Lunacy Unleashed

Notes from the field in the War on Spam

Akismet – Automattic Kismet

Last week I told you all about Automattic Spam Stopper, the new anti-spam solution for WordPress from Matt Mullenweg. There’s been some new news, and you’re going to hear it here first.

First off, the plugin has been renamed to Automattic Kismet, or Akismet for short.

Second, it now requires a WordPress.com API key, which you can find on your WordPress.com Profile page. (Click My Dashboard, then Profile.) If you don’t have a WordPress.com account, you won’t be able to use Akismet at this time, until you somehow finagle yourself an account. The fastest way is probably to use Flock. You don’t actually have to blog at WordPress.com to use Akismet, you just need the account to get the API key. You can use the API key at more than one blog, too.

Matt plans to have Akismet free for personal use, and charge “pro” bloggers $5 per month for the service. He’s defined pro bloggers as anyone making over $500 per month from their blogs. He also has a program set up for large enterprise installations, though I only know of one customer for that right now. However, anyone who participated in testing Akismet prior to today will be grandfathered in and have a free enterprise account forever.

Akismet is surprisingly effective at stopping spam. After having built a sufficiently large corpus of spam to draw from, it’s killing about 99.9% of incoming spam, and has a false positive rate less than 0.1%. However, when the central service goes down, all comments go into the moderation queue. The service has had some downtime, and on the sites where I’ve been testing Akismet, I’ve had to watch the moderation queue fairly closely. Matt says he’s working on new more reliable hosting for the service.

So where does Akismet fit into the overall spam prevention picture?

Akismet has a great advantage over most anti-spam solutions: by seeing incoming spam from all over the Internet, it can identify new spam very quickly, perhaps as soon as seconds after a spam run begins, once it’s in wider usage. It also is better in spam management, having to sort through hundreds of spams to find a legitimate one that might have been blocked by mistake. It presents spam in a compact format that makes it pretty easy to scan through and spot legitimate comments.

However, Akismet has a couple of drawbacks which are common to most anti-spam solutions for WordPress, and a couple of unique drawbacks of its own. The obvious ones are that it’s a for-pay solution for many people who might want to use it. It uses a central server which is subject to downtime. Though Matt hasn’t said much about the secret sauce, it definitely analyzes the content of incoming posts. And finally, it does nothing to keep the spammers from using up your bandwidth and database space.

For most people running a personal WordPress blog, Akismet is the ideal second line of defense. It will entirely replace plugins such as wp-hashcash, Spam Karma 2, AuthImage, etc. In fact, it makes most other anti-spam plugins entirely redundant.

The one anti-spam plugin which Akismet will not make redundant is Bad Behavior. There are several reasons for this. Bad Behavior is a first line of defense, stopping spammers before they can read your site at all, waste your bandwidth, or drop junk in your database. This is especially important for self-hosted sites, or sites hosted on dedicated or virtual dedicated servers, where CPU time and bandwidth are precious. Like most other anti-spam plugins, Akismet does not and cannot conserve its users’ bandwidth, CPU and disk usage from a spam attack. Bad Behavior does, meaning it will continue to be an integral part of most people’s anti-spam arsenals.

You may not think this is important, especially if you have never received a large amount of spam at once. But the day is coming when you will, and having that first line of defense can mean the difference between your site staying up, and your Web host shutting off your site. Spammers can easily hit you so hard as to create denial-of-service conditions, and Bad Behavior has been proven to mitigate this effect. In fact, it’s even stood up to the Slashdot effect without blinking.

I should disclaim at this point. I am involved in the development of Akismet, having rewritten a significant amount of the code from the time it was known as ASS, and integrating CJD’s Spam Nuker into the plugin. I continue to remain involved with Akismet as long as there’s work to do on it (and there are a couple of bugs I need to fix).

As I said yesterday, however, I remain committed to the development of Bad Behavior. It is still sorely needed as a first line of defense for WordPress, not to mention all of the other platforms on which it now runs.

What the future holds? Nobody can say for sure, but I predict that for WordPress users wanting to remain spam-free, the combination of Akismet with Bad Behavior will prove to be a double whammy to blog spammers. For everyone else, Bad Behavior remains the first line of defense, and Matt has said that Akismet could be ported to other platforms as well. Someone else, I think, will have to take up that challenge. My hands are full already. 🙂

P.S. Matt’s started a web site for Akismet, where you can find more information.

October 26, 2005 Posted by | Akismet, Blog Spam, Spam, WordPress, WordPress.com | 15 Comments

Bad Behavior 2 Roadmap

Update: Bad Behavior 2 development is on hold indefinitely. Find out why and how you can help.

Yesterday I said I was beginning work on Bad Behavior 2.0, the next generation of the Web’s premier link spam killer. And I did. I wrote some ten lines of code.

Before I go into the roadmap, I have to diverge a bit and explain something a lot of people may not be aware of.

Bad Behavior is open source software, released under the GNU General Public License, which you can find copies of all over the Internet, or included with the program. You don’t have to pay a cent to download or use it. However, developing it still costs me time and money, which is why it can go so long between minor releases. Unless (until) some cash comes in, it doesn’t get updated except in cases of dire emergency. Which only happens if I ship code with a typo in it, or Microsoft changes their search engine, or something like that.

I have hundreds of comments and trackback pings from users all over who have virtually eliminated their spam problems with Bad Behavior. And every so often, someone does click the nice PayPal button, to send a few bucks my way. Both are very much appreciated.

Killing blog spam has been mostly a labor of love, however, rather than cash, and as such, has to take a back seat to other more pressing concerns, like anything that generates revenue.

So what I’m going to do here is outline my roadmap for Bad Behavior 2.0, invite you to comment on it, and if you want to see it come about sooner rather than later, to vote with your dollars, pounds, euros, or whatever you have. The amount is blank, so fill in whatever you feel is appropriate.

And if you see any problems with it, or think it could be improved, you can comment on it as well.

First off, Bad Behavior needs to be even more modular than it is currently. Version 1 proved fairly easy to integrate into diverse PHP software packages with differing requirements for their plugins or modules, but it seems like each package requires something different. For version 2 I will have a structure put together to allow Bad Behavior to drop in much more easily into packages such as DotClear and Geeklog, where the plugin architecture is quite different than everything else. This will also have the side effect of opening Bad Behavior to porting to even more software packages. This will be the largest design change in version 2.

Second, Bad Behavior needs to deal with the database more intelligently. In version 1, I kept a log of requests which had been denied, expanded it to optionally include all requests, and expanded it again to include the reasons for denial. Then I started using the information in the log to make decisions. Version 2 will feature a complete redesign of the database table, and expansion into two tables, one strictly for logging (for you to stare at), one strictly for making decisions. I expect to gain significant performance improvements thereby, as well as being able to make more intelligent decisions on which requests should be allowed and which should not be.

Third, Bad Behavior’s API needs improvement. It started as a simple generic interface, and has already outgrown that interface. Version 2 will feature a completely redesigned API for integration into the host PHP program, offering more flexibility, and hopefully the ability for the host program to provide services to Bad Behavior, such as statistics and log viewing.

Fourth, that error page needs to be reworked. Most legitimate users unfortunate enough to see the page, have no idea what to do, even though the page does provide suggestions. It needs to be shortened, clarified and contain links to expanded information sources so that users can solve the problem on their own whenever possible. It should also customize the message based on the specific reasons for denial. Though the ideal is that Bad Behavior should never present the page to a legitimate user — only to spammers.

Along those lines, the 412 error code will be changed. Bad Behavior 2 will attempt to deliver the most appropriate error code to the denial circumstance. In some circumstances, version 2 will return a 403 error. In others, perhaps a 412. In one case, a 417 is appropriate.

Fifth, Bad Behavior needs to provide better tools for site administrators to search for and eliminate any false positives that may arise. While version 1 contains whitelisting capability, it’s not easy for a site owner to determine why a particular request was blocked, due to being unable to find it in the logs. Version 2 will provide a unique key to each denied request which the site owner can use to immediately find the problem, if any, and take any necessary corrective action.

Finally, Bad Behavior must continue to keep up with spammers as they attempt to adapt and find new ways to post their automated garbage. To date, this has been at most a minor issue, as there is only so much the spammers can do, while maintaining their high rates of spamming (10,000 or more posts in a single run is not unusual). Bad Behavior attempts to drive up the cost of link spamming, by blocking as many of those spammy requests as possible, forcing the spammers to resort to MUCH slower manual methods, or ideally, give up and find more honest work.

This is my vision for Bad Behavior 2. All things being as they are right now, the timeline for all this is anywhere from one to six months. How quickly it gets done depends on you.

Without any further contributions to Bad Behavior development, I’ll work on it in my limited free time, and it’ll take somewhere around six months. If I were to receive, for instance, $500 in contributions, I could devote a significant amount of time to it, and complete it within the next month. Hey, don’t laugh, that’s only a few cents per user.

If you think this roadmap looks good, and want to accelerate the development of Bad Behavior, contribute financially and I’ll be able to devote more time to it, meaning version 2 comes closer to reality sooner. And by all means, if you think I left something out that should be in version 2, please let me know. And yes, I know a lot of you are flat broke, so even if you are unable to contribute financially, leave a comment. Say hi, or suggest changes, or something, just so that I know you’re there and you think I should continue this project.

October 25, 2005 Posted by | Bad Behavior, Blog Spam, Spam, WordPress | 32 Comments

Bad Behavior 1.2.3

Make a Donation.

Bad Behavior 1.2.3, the latest release of the Web’s premier link spam killer, has been released.

From version 1.2.2, the following changes have been made:

  • Several additional spambots, mostly causing trackback spam, have been fingerprinted and blocked.
  • Bad Behavior 1.2.2 introduced a requirement that user agents present a User-Agent header. This caused quite a few unexpected things to break, and has been changed in this release. A User-Agent header will now be required only for POST requests.

There were a couple of other changes and enhancements I wanted to put in, but I decided to let them wait; I’ve begun work on version 2.0 of Bad Behavior, which will incorporate these changes. If you e-mailed me and didn’t see your change in here, this is why.

Anyways, it’s that time again, so download Bad Behavior now!

October 23, 2005 Posted by | Bad Behavior, Blog Spam, Spam, WordPress | 14 Comments

On stupid people

We see them every day, and usually we make fun of them.

They’re the stupid. The daft. The incompetent. The people who can’t seen to find their way out of a wet paper bag.

Some examples:

People looking for a job application for Target or Walmart (I misspelled that on purpose; if you’re really curious, ask me why, but not in comments) and can’t comprehend the simple fact that the applications simply are not online.

The people so high on methamphetamine they apparently didn’t realize they were smoking right in front of a police station.

The stupidity of government officials and the stupidity of phone companies.

I could go on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on and on. But I won’t.

The question of the day is: How do we make people less stupid? Is it even possible?

What’s wrong with our world, that so many people live their daily lives in a fog of stupidity?

October 20, 2005 Posted by | WordPress.com | 7 Comments

Want a link farm? How about some spam?

The following bit of spam arrived on my contact form last night. Nothing has been changed, because the guilty don’t need protecting here.

Jill wrote:
I have an offer for your business if you’re interested in increasing revenues each month. I’ll cut right to the chase. We’re looking for 1 of 2 things. Or both:

1. Allow us to place targeted advertising on your existing website. We would share any advertising revenues with you at an agreed upon percentage. We’re masters of online advertising, so we can probably unlock new cash flow for you that might have otherwise never been tapped.

2. Allow us to set up around 10 subdomains or subfolders off of your website–for example: http://www.subdomain.YOURURL.com OR http://www.YOURURL.com/subfolder These would contain sites we control and be on a variety of topics. You would have to switch the DNS info for these new subdomains over to one of our servers so that we can make changes to these sites. For this we would pay you a monthly fee that we both feel is fair.

Anyways, if you could get back to me as soon as possible it would be appreciated. We would like to make this a win/win situation! I Hope to hear from you soon!

Jill

jill@masterlinkservice.com

p.s. I do not want to waste any of your time. If you’re not interested please just delete the message and I will not contact you again. I feel the offer is a win/win however and that we can make lots of money together!

p.p.s. I hope to hear from you soon!

Website:
IP: 142.161.37.169

For the unfamiliar, I’ll explain these two ideas in some depth.

The first one sounds like a typical advertising campaign you might see on a blog, such as AdSense or BlogAds. Only this one is bound to contain ads you don’t want on your site, like online casinos or erectile dysfunction drugs. In the case of this company, I’m going to guess home improvement loans, based on some domains I caught this company involved with.

The second one is absolutely something you should never, ever do if you want to be found in a search engine. Companies which get control of a portion of your domain space in this way will typically do one or both of two things:

  1. They’ll post “free” articles on various topics, which also happen to be rather boilerplate, and appear at various other domains across the Internet. One example I caught this company doing involves http://www.homeloaninfobox.com and http://www.homeinus.com which contain exactly the same content, word-for-word. Search engines catch on to this sort of trick and lower both sites in their results.
  2. The more evil possibility is that of a link farm, pages with dozens or hundreds of links to various other sites, which contain dozens or hundreds of links to the same sites. Spammers want sites outside their own sites to link to them, so as to increase their legitimacy, and decrease the chance that their link farm will be caught. Google delists entire domains that it finds involved in link farms, and this is definitely not something you want to happen to you. It happened to Matt Mullenweg of WordPress. He thought it was a good idea at the time, but it turned out to be anything but.

There are good ways to make money on your blog, and there are bad ways. Those are two very bad ways.

October 18, 2005 Posted by | AdSense, Advertising, Blog Spam, Google, Link Farm, Spam, WordPress | 1 Comment

Wall Street Journal interview on blog spam

Last week a reporter from the Wall Street Journal emailed me to set up an interview for an upcoming article on the problem of blog spam. Now the first thing that came to mind is, “What does the Wall Street Journal care about blog spam?”

You may as well ask, what do they care about the Internet? Jeff Jarvis, who’s been around this particular block a few times, will tell you that “old media” needs to start caring about the Internet.

Virtually every company around now has an Internet presence, but these days that isn’t enough. Customers expect to be able to have an actual back-and-forth conversation with the companies they do business with online, and that’s where blogs come in. Companies that set up blogs now, and actually engage their customers online, are going to be the ones who survive the next big Internet shakeout. (See also “Web 2.0.”)

Anyway, the reason the Wall Street Journal cares is because its readers, typically corporate executives, investors, etc., either already care, or desperately need to start caring about blogs and blogging.

Which brings me to the interview, which thanks to some problems with my cell phone, didn’t actually take place. As of this writing the article hasn’t been published either, so maybe I’ll hear from him. Or maybe not.

Anyway.

Since you’re reading a blog, you probably already know something about blogging. I won’t bore you describing that. But you might not have seen blog spam.

Most blogs allow readers to leave comments and feedback on each individual entry, as well as trackbacks, which are automated notifications from other blogs sent when one blog references another. The comments and the trackbacks are then posted to the original posting. In this way is the blogosphere built.

At some point, spammers noticed this, and began developing automated methods of posting links to their own websites to blogs, using both the comment and trackback mechanisms. They have two goals in doing so: first and foremost, to drive traffic to their sites and income to their pockets, and second, to increase the search engine rank of their sites.

Early this year Google introduced a new standard called nofollow which blogs could apply to deny spammers search engine rank. Google, Yahoo and MSN all implemented the standard, as did most major blog, wiki and CMS platforms. But nofollow only addressed the secondary purpose, not the primary purpose, of blog spam, so nofollow hasn’t delivered on its promise to stop spam.

For WordPress, in the beginning, was Spam Karma. Spam Karma is a most excellent piece of software that does indeed block just about every piece of spam a blog might ever receive. It has one significant drawback, though: the spam sticks around, and you, the blogger, still have to deal with it. Spam Karma mails out digest e-mails with a summary of the spam caught, at least oncce per day. But get 50 or more, and you’ll get more than one e-mail. I’ve spoken to a blogger who received dozens of these e-mails daily, representing hundreds of spams being caught. Every day.

That’s a lot of work for anybody to do, let alone a blogger who just wants to write.

So, back in April, after the announcement of a WordPress plugin competition, I decided to do something that would stop spam. I had a completely novel idea which, as far as I could find, had never been tried before, and within a few days, had some working code. I tried it on a few guinea pigs, and it seemed good. And on the 24th April, the first release candidate of Bad Behavior went out the door.

It was a huge success, even far beyond my expectations.

Going in, I decided that it would be sufficient to stop most, if not all, spam, as long as there were absolutely no false positives, i.e. real people being blocked out. To this day Bad Behavior has kept this primary design goal.

It blocks between 90% to 99% of incoming spam before the blogger even has to think about spam, and on a very popular site, this is a lot of spam. Bad Behavior is running on sites which receive thousands of spam attempts daily, and blocks virtually all of them. The few messages which do get through are easy enough to deal with. On most sites, this may be one or two messages a week; on the busiest sites, five to ten a day. Compare that to 200 to 15,000 attempts a day, and you see the difference.

And, it doesn’t bother anybody with digest e-mails, or summaries, or even how many spammers it’s blocked. “I love how it is completely automated. No user involvement needed,” said Mark Jaquith. But some people want to know what it’s doing. This led one blogger to build a statistics plugin for Bad Behavior.

When I started this, I really had no idea what would happen. But it’s become a long-term project. I don’t see any end to blog spam anytime soon, and so I don’t see any end to Bad Behavior anytime soon.

The simple truth of the matter is, there are too many unprotected blogs out there. Technorati reports that only 55% of blogs have a post made in the last three months at any given time, a statistic they say is “consistent throughout the last year.” That means 45% of the 14 million weblogs out there have been virtually abandoned by their authors.

It’s primarily these blogs that spammers target.

To make a significant dent in this type of spam, blog software needs to ship out-of-the-box with better spam controls. WordPress author Matt Mullenweg attended the second annual web spam summit and now has something cooking. I’ve reviewed Matt’s solution, the Automattic Spam Stopper.

I’m just pissed off that I didn’t hear about the summit until it was over; I could easily have attended.

October 12, 2005 Posted by | Bad Behavior, Blog Spam, WordPress | 2 Comments

Automattic Spam Stopper

Recently, Matt Mullenweg, creator of WordPress, had a bright idea on how to stop blog spam. He wrote up some code, distributed his new WordPress plugin to a small group of testers, and so was born the so-called Automattic Spam Stopper, or ASS.

I was able to obtain a copy of Automattic Spam Stopper for review and made a quite disturbing discovery, namely, how it works.

Whenever a user makes a comment to your WordPress blog, ASS forwards a copy of the entire comment, the metadata such as username, email address and URI, as well as your blog address and Web server environment variables, to a central server for analysis. The server then returns the response “true” if the comment is judged to be spam.

Mullenweg isn’t saying what the “secret sauce” is for the server, so as to frustrate the spammers. “By the time we’re done spammers around the world will quiver in their boots,” said Mullenweg.

So how does the server determine what’s spam? Users of the plugin submit copies of any spam they receive by marking them as spam in the WordPress administration panel. ASS then forwards copies of these to the server for analysis.

The submitted spam, however, remains in your database, but hidden from view. This could cause resource constraint (disk space) problems, and backup/restore problems, for many users, especially after time. WordPress does not automatically remove spam from its database, and does not provide any method for removing it from the database. A third-party plugin, however, does provide this function.

Right now Mullenweg inspects all comments submitted this way manually, before the server considers them to be spam. If he judges them to actually be spam, then they are added to the server’s corpus, or database of submitted spam.

He has not said, however, whether legitimate comments are kept on the server, or whether anyone else looks at the submissions. Thus, ASS may not be a good anti-spam choice for private blogs, or for blogs which frequently use password protection to limit access to their contents. In a very real sense it comes down to whether you trust Matt Mullenweg with your readers’ comments. Some people will, and others won’t.

Mullenweg envisions ASS as a service which is free for personal use, and paid for business use. “I would be more comfortable with something where it was free for regular people, and only businesses or enterprises paid (enough to support everybody),” he said.

“There may be ‘keys’ or accounts at some point to prevent abuse,” he said. “However the plugin and API are designed to be pretty easy to recreate, so if someone wanted to run their own spam [prevention] service they could easily.”

That much is true. I could create a server to do this in rather short time. And I almost did. It’s been an idea that’s been discussed before among WordPress anti-spam gurus, and ultimately rejected.

To date no one has been able to provide a centralized server solution which ensures the integrity of the database, for instance. Mullenweg ensures the integrity of his database by inspecting all comments manually, but this “solution” doesn’t scale very well, and is untenable once ASS is released to a wider audience. He has proposed that users be registered and receive keys in order to use the service, but even this doesn’t prevent spammers themselves from registering and submitting garbage to the database.

In addition, no one has been able to provide a centralized server solution which ensures the privacy of users whose comments are subject to this sort of analysis, especially with respect to private blogs and password-protected posts, where users expect their comments to be private. I’ve come up with an idea or two on how this might be done, but I’m not sharing until I’m certain it really can be done; if it were really that easy, it seems that someone would have done it already.

Now if Mullenweg can solve the problems of privacy, integrity, scalability, and those gigabytes of spam clogging up his users’ databases, he may be on to something. But everyone else who’s had this idea ultimately scaled it back or dropped it entirely. I fail to see how Matt’s ASS is any different.

In the meantime, if you’re looking to stop spam without compromising your users’ privacy, consider Bad Behavior, which is shockingly effective despite not looking at the content of comments at all, and Spam Karma, which does, but doesn’t send the whole comment, and much of your server information, off to who knows where.

Update: Some other reviews of Automattic Spam Stopper:

October 10, 2005 Posted by | Bad Behavior, Blog Spam, WordPress, WordPress 1.6, WordPress.com | 13 Comments

Citadel: Groupware secret revealed

Since 1988 or so, the Unix version of Citadel, and since 1981, its DOS and CP/M predecessors, have been the choice for people wanting to start a community site on the Internet or on dialup. From its beginnings as a simple, easy to use bulletin board system, though, Citadel has grown to become much more.

Today Citadel supports most every major communications protocol you can think of, all by itself, without the help of other programs. It does SMTP, POP3 and IMAP for e-mail, for instance, handling multiple domains and virtual domains without blinking. It speaks natively to SpamAssassin without using any outside libraries. It includes LDAP directory support.

And it has its own lightweight, streamlined Web server. You don’t even need Apache. The server handles both normal HTTP and secure (HTTPS) connections easily, generating its own certificates or using those provided by your favorite certificate authority. In addition to the rooms (forums) which make up the core of Citadel, the Web server provides Web mail service, as well as calendaring and scheduling.

Oh, I forgot calendaring and scheduling? I’m sorry. Citadel does that too, using standard iCal/vCal objects. And it speaks GroupDAV.

And most everything is built in. No sendmail, postfix, dovecot, cyrus, etc., to mess with. Citadel takes care of it all for you. (OpenLDAP turns out to be much better than anything we could code up, so that is external.) Install, set it up, and forget it.

Yes, forget it. Citadel also does its own maintenance, with daily scheduled jobs. In the rare event of a crash, Citadel recovers itself if needed, too, minimizing downtime. Citadel restarts and recovers so fast, Web-connected users may not even realize anything happened. It just continues working.

Citadel installations are capable of handling over 1,000 simultaneous users on old hardware, and many more than that on the good stuff. And if you do happen to run into a hardware limit, throw in another server. The two Citadel servers will talk to each other and keep everything in sync. Still not enough? There’s no limit to how many Citadels you can have. You can even create an extranet Citadel to connect your partners and vendors to people and information on your Citadels that you designate.

So why haven’t you heard of Citadel before?

For one thing, the Citadel developers are working hard on the next major release, which will be version 7. This version promises a Python interface, making the server and client entirely scriptable and extensible via simple Python scripts. Say goodbye to Bloated Goats. Building custom applications is about to get easy.

Second, Citadel hasn’t actively sought too much publicity. Sure, there’s been the occasional /. posting, but mostly the developers have been placing their primary focus on delivering top notch software.

Now you’re in on the secret. Sick of sendmail’s security bug of the week? Exchange crashed again and took everyone’s calendar with it? Microsoft Outbreak let another virus into the intranet again? Want your email and calendaring to Just Work? With a nice web interface for the road warrior executive types? It’s time to take a good look at Citadel.

October 4, 2005 Posted by | Citadel | 3 Comments