You are here

README.txt in Spam 6

Same filename and directory in other branches
  1. 5.3 README.txt
  2. 5 README.txt
--------
Upgrades
--------
At this point, the upgrade from 1.0 or 5.x to 1.1 and 1.2 is not well
supported. It seems to be mainly due to the handling of the spam_filters
table, but there could be other things in the way.

See: http://drupal.org/node/1137648
     http://drupal.org/node/1185292

---------
Overview:
---------
The Spam module provides numerous tools to auto-detect and deal with spam
content that is posted to your site, without having to rely on third-party
services.

The Spam module provides a trainable Bayesian filter, detection of content
posted from open email relays, flagging of content with an excessive amount of 
links, the ability to create custom filters, and more.

Features:
   * Can be used completely independently of any third-party service.
   * Automatically learns and blocks spammer URLs and IPs.
   * Detects repeated postings of the same identical content, or content
     containing too many links.
   * Can notify the user and/or administrator that content was determined to be
     spam, preventing confusion over why their content doesn't show up.
   * Allows filtered users to provide feedback when their postings are
     incorrectly flagged as spam.
   * Provides comprehensive logging to offer an understanding as to how and why
     content is determined to be or not to be spam.
   * Language-independent: Automatically learns to detect spam in any language
     using Bayesian logic.
   * Supports the creation of custom filters using powerful regular expressions.
   * Written in PHP specifically for Drupal.
   * Highly configurable and extendable (includes hooks for writing custom
     filters).

-------------
Spam filters:
-------------
The spam api module includes several spam filter modules, all of which work
together to try and determine if a given piece of content is spam.  Each module
will review the content and return a score between 1 and 99, where 1 means there
is a 1% chance that the scanned content is spam and 99 means there is a 99%
chance that the scanned content is spam.  The spam api module takes a weighted
average of all of these scores and assigns a final overall score for the
content.  Based on this final score, the content may or may not be allowed to
be posted on your website.

To see a list of all enabled spam filter modules, log in as a website
administrator and visit "Administer >> Site configuration >> Spam >> Filters".
On this page, filters are listed according to their weight, with lighter weights
floating to the top.  The filters are run in the order they are listed, but at
this time all filters are always run so order is not important.  It is possible
to disable individual modules on this page.  Finally, you can also set a "gain"
for each module.  


  Gain:
  -----
  The gain can be set to any value from 0 to 250.  The gain is a %, so a gain
  of 100 is a 100% gain, and a gain of 250 is a 250% gain.  Each spam filter
  module is assigned a gain.  The spam api module uses this gain to weight
  the spam score returned by that spam filter module.  Thus, if a module is
  given a gain of 0%, this effectively disables the module as any score it
  returns is ignored. (It is much more efficient to actually disable the module,
  as there is overhead from running the filters even if the final score is
  ignored.)

  The more confident you are of a given spam filter's score, the higher the
  gain should be.  The less confident you are of a given spam filter's score,
  the lower the gain should be.  The score returned by a filter with a gain of
  250 has two and a half times the effect of a score returned by a filter with
  a gain of 100.

  When first training your Bayesian filter, it will be inherently be wrong much
  of the time.  Thus, when you first enable the Bayesian filter you should
  set the module's gain to a low value.  After it has been sufficiently trained,
  can then increase the gain to a higher value.


  Duplicate filter:
  -----------------
  The duplicate filter calculates a hexidecimal "hash" for content as it is
  posted to your website.  If the same exact content is posted again, it will
  generate the same "hash" and be detected as duplicate content.  This module
  can then prevent this duplicate content from being posted, and can
  automatically unpublish the previous duplicate posts.

  The duplicate filter also tracks how many times the same IP address has been
  used to post spam.  If the same IP address posts spam more than a configurable
  number of times, the IP address can be automatically banned from posting any
  further content to your website.
  
  This spam filter module can be configured by visiting "Administer >> Site
  configuration >> Spam >> Filters >> Duplicate".  By default, if the same
  identical content is posted twice it is flagged as spam and unpublished.  If
  the same IP address is found to have posted more than three pieces of spam
  content the IP is blacklisted and prevented from posting any further content.

  IP addresses are blacklisted only as long as the spam exists on your website.
  Once the spam is deleted, the IP is no longer blacklisted.


  SURBL filter:
  -------------
  SURBLs are lists of web sites that have appeared in unsolicited messages.
  Unlike most blacklists, SURBLs are _not_ lists of message senders.

  The SURBL filter is integrated with several online SURBL lists, checking if
  any of the URLs found in new content exists in these lists.  If no URLs
  match, the filter does not return any score and the filter is ignored.  If
  one or more URLs match, the filter flags the content as highly probably spam.

  There is currently no configuration possible for the SURBL module.


  URL filter:
  -----------
  The URL filter scans all new content for URLs.  It then remembers if this
  URL was found in spam content or non-spam content.  If the URL is more often
  found in spam content than non-spam content, then the new content is flagged
  as being highly probably spam.

  There is currently no configuration possible for the URL filter.


  Custom filter:
  --------------
  The custom filter allows you to manually define one or more text strings or
  regular expressions to try and match against new site content.  If no custom
  filter matches, then the module will not return a score and the filter will
  be ignored.

  All existing filters will be listed on this page.  One or more filters can
  be quickly disabled or deleted through this interface.  Statistics are
  provided as to how frequently each filter is matching content, and when the
  last match occurred.  To re-enable or otherwise reconfigure a specific filter
  click the "edit" link.

  To create custom filters, visit "Administer >> Site configuration >> Spam >>
  Filters >> Custom".  To create a new filter, click the 'create custom filter'
  link at the bottom of that page.

  New filters can be a simple text string, or a more complex regular expression.
  For example, your filter may simply be the word 'spam'.  Or, if a regular
  expression your filter may be '/spam/i'.  For more information on creating
  valid regular expressions visit this page:
    http://www.php.net/manual/en/ref.pcre.php

  Custom filters can scan any combination of the content itself, the referrer
  URL associated with the posted content, and the user agent that was used to
  post the content.

  Matching filters can be used to detect spam content as well as to detect non-
  spam content.  For other filters you may simply want to note that a match
  means that probably is or probably is not spam.


  Node age filter:
  ----------------
  The node age filter only affects comments.  It ignores new nodes and users.
  When comments are posted, the node age filter looks at how long ago the
  node was posted to your website.  The older the node, the more likely the
  filter considers the comment to be spam.

  This module can be configured by visiting "Administer >> Site configuration >>
  Spam >> Filters >> Node age".  Here you can define what qualifies as "Old
  content", and what qualfies as "Really old content".  By default, "old
  content" is content that was posted more than 4 weeks ago, and comments
  posted on old content are considered 85% likely to be spam.  "Really old
  content" is content that was posted more than 8 weeks ago, and comments
  posted on really old content are considerd 99% likely to be spam.


  Bayesian filter:
  ----------------
  The Bayesian filter performs simple statistical analysis on content, learning
  from spam and non-spam that it sees to determine the liklihood that new
  content is or is not spam. The filter starts out knowing nothing, and has to
  be trained every time it makes a mistake. This is done by marking spam
  content on your site as spam when you see it. Each word of the spam content
  will be remembered and assigned a probability. The more often a word shows up
  in spam content, the higher the probability that future content with the same
  word is also spam.

  When first enabling the Bayesian filter, it is recommended that you visit
  "Administer >> Site configuration >> Spam >> Filters" and set the Gain for
  this module to a low value.  This is because until the module is trained, it
  will assume that all words have a 40% liklihood of being spam.

  As spam is posted to your website, simply click the 'Mark as spam' link to
  start training your Bayesian filter.  You should also regularly visit 
  "Administer >>  Content management >> Comments" and put a checkmark next to
  new comments that you know are valid and are not spam, then select "Teach
  filters selected comments are not spam" and click the "Update" button.  This
  step is critical to teaching your Bayesian filter what is and what is not
  spam.

  The Bayesian filter is language agnostic.  It does not have any configuration
  options at this time.


---------------
Reviewing Spam:
---------------
All content that has been marked as spam can be reviewed by visiting "Administer
>> Content management >> Spam".  You can optionally choose to filter this
listing by content type and/or IP address.  Controls are provided to easily
mark the content as not spam, or to simply publish or unpublish it.

Comment spam can also be found by visiting "Administer >> Content management >>
Comments >> Spam".  From this page, spam comments can be marked as not-spam or
simply deleted.


---------
Feedback:
---------
The spam filter is a useful collection of tools, but it can certainly make
mistakes, marking valid content as spam.  Users of your website can help you
to better train your filters by providing feedback when their content is
incorrectly blocked by your spam filters.

As an administrater, you should regularly go to "Administer >> Content
management >> Spam >> feedback" to review any feedback provided by your
visitors.  Carefully review the content and their feedback before
deciding whether or not to post the blocked content.  If you publish the
content, your filters will automatically learn that this content should not
have been blocked.  If you do not publish the content, it will be permanently
deleted from your website.


--------
Reports:
--------
The spam module implements its own custom logging facility.  These logs can be
reviewed by visiting "Administer >> Reports >> Spam logs".  Your log level will
determine just how much information is logged about each piece of content that
is scanned with the spam module.  If significant information is being logged,
you may find it useful to click the 'trace' link to trace through all actions
taken by the spam module.  You can also click the 'detail' link to see more
information about each log entry.

At the top of this page, click the "Statistics" link to see learn more about
how the spam filter is performing.  At this time only raw data is collected,
but at a future time we plan to provide useful reports showing the effectiveness
of the spam filter modules.

Finally, click the "Blocked IPs" tab to see a list of all IP addresses that
are currently being blocked by the spam filter.  This page will also show how
many times a given IP address has been blocked from posting content, as well
as the last time the IP address was blocked.


--------------
Configuration:
--------------
Initial configuration of this module is documented in INSTALL.txt.

Configuration of the module is done at "Administer >> Site configuration >>
Spam".  On this page, you can tell the module which types of content should
be scanned.  You can also tell the module which actions it should take when
spam is detected.


  Advanced configuration:
  -----------------------
  It is generally recommended that you do not make any changes to the advanced
  configuration options.

  The spam threshold is used to decide what content is spam.  All content is
  assigned a score from 1 to 99.  Any content with a score that is equal to or
  greater than the spam threshold is considered to be spam.  Any content with a
  score that is less than the spam threshold is considered to not be spam.
  Changing the spam threshold can have negative consequences, especially on
  websites that have been operating for a long time with a different spam
  threshold.  Old content that has already been scanned will not be affected
  when you change the spam threshold -- this setting only affects new content.

  When trying to learn how the spam filters work, or trying to understand why
  content is incorrectly slipping through the filters or being marked as spam,
  it can be helpful to change the log level.  The debug log level will provide
  you with a huge amount of information about each piece of content that is
  scanned by your filters, but it will also result in a large database load
  from writing all of these logs.

  Many individual spam filters also have their own configuration which is
  already defined earlier in this document.


------
Other:
------
TODO: Describe how to add custom CSS tags to your theme (override theme_comment)

File

README.txt
View source
  1. --------
  2. Upgrades
  3. --------
  4. At this point, the upgrade from 1.0 or 5.x to 1.1 and 1.2 is not well
  5. supported. It seems to be mainly due to the handling of the spam_filters
  6. table, but there could be other things in the way.
  7. See: http://drupal.org/node/1137648
  8. http://drupal.org/node/1185292
  9. ---------
  10. Overview:
  11. ---------
  12. The Spam module provides numerous tools to auto-detect and deal with spam
  13. content that is posted to your site, without having to rely on third-party
  14. services.
  15. The Spam module provides a trainable Bayesian filter, detection of content
  16. posted from open email relays, flagging of content with an excessive amount of
  17. links, the ability to create custom filters, and more.
  18. Features:
  19. * Can be used completely independently of any third-party service.
  20. * Automatically learns and blocks spammer URLs and IPs.
  21. * Detects repeated postings of the same identical content, or content
  22. containing too many links.
  23. * Can notify the user and/or administrator that content was determined to be
  24. spam, preventing confusion over why their content doesn't show up.
  25. * Allows filtered users to provide feedback when their postings are
  26. incorrectly flagged as spam.
  27. * Provides comprehensive logging to offer an understanding as to how and why
  28. content is determined to be or not to be spam.
  29. * Language-independent: Automatically learns to detect spam in any language
  30. using Bayesian logic.
  31. * Supports the creation of custom filters using powerful regular expressions.
  32. * Written in PHP specifically for Drupal.
  33. * Highly configurable and extendable (includes hooks for writing custom
  34. filters).
  35. -------------
  36. Spam filters:
  37. -------------
  38. The spam api module includes several spam filter modules, all of which work
  39. together to try and determine if a given piece of content is spam. Each module
  40. will review the content and return a score between 1 and 99, where 1 means there
  41. is a 1% chance that the scanned content is spam and 99 means there is a 99%
  42. chance that the scanned content is spam. The spam api module takes a weighted
  43. average of all of these scores and assigns a final overall score for the
  44. content. Based on this final score, the content may or may not be allowed to
  45. be posted on your website.
  46. To see a list of all enabled spam filter modules, log in as a website
  47. administrator and visit "Administer >> Site configuration >> Spam >> Filters".
  48. On this page, filters are listed according to their weight, with lighter weights
  49. floating to the top. The filters are run in the order they are listed, but at
  50. this time all filters are always run so order is not important. It is possible
  51. to disable individual modules on this page. Finally, you can also set a "gain"
  52. for each module.
  53. Gain:
  54. -----
  55. The gain can be set to any value from 0 to 250. The gain is a %, so a gain
  56. of 100 is a 100% gain, and a gain of 250 is a 250% gain. Each spam filter
  57. module is assigned a gain. The spam api module uses this gain to weight
  58. the spam score returned by that spam filter module. Thus, if a module is
  59. given a gain of 0%, this effectively disables the module as any score it
  60. returns is ignored. (It is much more efficient to actually disable the module,
  61. as there is overhead from running the filters even if the final score is
  62. ignored.)
  63. The more confident you are of a given spam filter's score, the higher the
  64. gain should be. The less confident you are of a given spam filter's score,
  65. the lower the gain should be. The score returned by a filter with a gain of
  66. 250 has two and a half times the effect of a score returned by a filter with
  67. a gain of 100.
  68. When first training your Bayesian filter, it will be inherently be wrong much
  69. of the time. Thus, when you first enable the Bayesian filter you should
  70. set the module's gain to a low value. After it has been sufficiently trained,
  71. can then increase the gain to a higher value.
  72. Duplicate filter:
  73. -----------------
  74. The duplicate filter calculates a hexidecimal "hash" for content as it is
  75. posted to your website. If the same exact content is posted again, it will
  76. generate the same "hash" and be detected as duplicate content. This module
  77. can then prevent this duplicate content from being posted, and can
  78. automatically unpublish the previous duplicate posts.
  79. The duplicate filter also tracks how many times the same IP address has been
  80. used to post spam. If the same IP address posts spam more than a configurable
  81. number of times, the IP address can be automatically banned from posting any
  82. further content to your website.
  83. This spam filter module can be configured by visiting "Administer >> Site
  84. configuration >> Spam >> Filters >> Duplicate". By default, if the same
  85. identical content is posted twice it is flagged as spam and unpublished. If
  86. the same IP address is found to have posted more than three pieces of spam
  87. content the IP is blacklisted and prevented from posting any further content.
  88. IP addresses are blacklisted only as long as the spam exists on your website.
  89. Once the spam is deleted, the IP is no longer blacklisted.
  90. SURBL filter:
  91. -------------
  92. SURBLs are lists of web sites that have appeared in unsolicited messages.
  93. Unlike most blacklists, SURBLs are _not_ lists of message senders.
  94. The SURBL filter is integrated with several online SURBL lists, checking if
  95. any of the URLs found in new content exists in these lists. If no URLs
  96. match, the filter does not return any score and the filter is ignored. If
  97. one or more URLs match, the filter flags the content as highly probably spam.
  98. There is currently no configuration possible for the SURBL module.
  99. URL filter:
  100. -----------
  101. The URL filter scans all new content for URLs. It then remembers if this
  102. URL was found in spam content or non-spam content. If the URL is more often
  103. found in spam content than non-spam content, then the new content is flagged
  104. as being highly probably spam.
  105. There is currently no configuration possible for the URL filter.
  106. Custom filter:
  107. --------------
  108. The custom filter allows you to manually define one or more text strings or
  109. regular expressions to try and match against new site content. If no custom
  110. filter matches, then the module will not return a score and the filter will
  111. be ignored.
  112. All existing filters will be listed on this page. One or more filters can
  113. be quickly disabled or deleted through this interface. Statistics are
  114. provided as to how frequently each filter is matching content, and when the
  115. last match occurred. To re-enable or otherwise reconfigure a specific filter
  116. click the "edit" link.
  117. To create custom filters, visit "Administer >> Site configuration >> Spam >>
  118. Filters >> Custom". To create a new filter, click the 'create custom filter'
  119. link at the bottom of that page.
  120. New filters can be a simple text string, or a more complex regular expression.
  121. For example, your filter may simply be the word 'spam'. Or, if a regular
  122. expression your filter may be '/spam/i'. For more information on creating
  123. valid regular expressions visit this page:
  124. http://www.php.net/manual/en/ref.pcre.php
  125. Custom filters can scan any combination of the content itself, the referrer
  126. URL associated with the posted content, and the user agent that was used to
  127. post the content.
  128. Matching filters can be used to detect spam content as well as to detect non-
  129. spam content. For other filters you may simply want to note that a match
  130. means that probably is or probably is not spam.
  131. Node age filter:
  132. ----------------
  133. The node age filter only affects comments. It ignores new nodes and users.
  134. When comments are posted, the node age filter looks at how long ago the
  135. node was posted to your website. The older the node, the more likely the
  136. filter considers the comment to be spam.
  137. This module can be configured by visiting "Administer >> Site configuration >>
  138. Spam >> Filters >> Node age". Here you can define what qualifies as "Old
  139. content", and what qualfies as "Really old content". By default, "old
  140. content" is content that was posted more than 4 weeks ago, and comments
  141. posted on old content are considered 85% likely to be spam. "Really old
  142. content" is content that was posted more than 8 weeks ago, and comments
  143. posted on really old content are considerd 99% likely to be spam.
  144. Bayesian filter:
  145. ----------------
  146. The Bayesian filter performs simple statistical analysis on content, learning
  147. from spam and non-spam that it sees to determine the liklihood that new
  148. content is or is not spam. The filter starts out knowing nothing, and has to
  149. be trained every time it makes a mistake. This is done by marking spam
  150. content on your site as spam when you see it. Each word of the spam content
  151. will be remembered and assigned a probability. The more often a word shows up
  152. in spam content, the higher the probability that future content with the same
  153. word is also spam.
  154. When first enabling the Bayesian filter, it is recommended that you visit
  155. "Administer >> Site configuration >> Spam >> Filters" and set the Gain for
  156. this module to a low value. This is because until the module is trained, it
  157. will assume that all words have a 40% liklihood of being spam.
  158. As spam is posted to your website, simply click the 'Mark as spam' link to
  159. start training your Bayesian filter. You should also regularly visit
  160. "Administer >> Content management >> Comments" and put a checkmark next to
  161. new comments that you know are valid and are not spam, then select "Teach
  162. filters selected comments are not spam" and click the "Update" button. This
  163. step is critical to teaching your Bayesian filter what is and what is not
  164. spam.
  165. The Bayesian filter is language agnostic. It does not have any configuration
  166. options at this time.
  167. ---------------
  168. Reviewing Spam:
  169. ---------------
  170. All content that has been marked as spam can be reviewed by visiting "Administer
  171. >> Content management >> Spam". You can optionally choose to filter this
  172. listing by content type and/or IP address. Controls are provided to easily
  173. mark the content as not spam, or to simply publish or unpublish it.
  174. Comment spam can also be found by visiting "Administer >> Content management >>
  175. Comments >> Spam". From this page, spam comments can be marked as not-spam or
  176. simply deleted.
  177. ---------
  178. Feedback:
  179. ---------
  180. The spam filter is a useful collection of tools, but it can certainly make
  181. mistakes, marking valid content as spam. Users of your website can help you
  182. to better train your filters by providing feedback when their content is
  183. incorrectly blocked by your spam filters.
  184. As an administrater, you should regularly go to "Administer >> Content
  185. management >> Spam >> feedback" to review any feedback provided by your
  186. visitors. Carefully review the content and their feedback before
  187. deciding whether or not to post the blocked content. If you publish the
  188. content, your filters will automatically learn that this content should not
  189. have been blocked. If you do not publish the content, it will be permanently
  190. deleted from your website.
  191. --------
  192. Reports:
  193. --------
  194. The spam module implements its own custom logging facility. These logs can be
  195. reviewed by visiting "Administer >> Reports >> Spam logs". Your log level will
  196. determine just how much information is logged about each piece of content that
  197. is scanned with the spam module. If significant information is being logged,
  198. you may find it useful to click the 'trace' link to trace through all actions
  199. taken by the spam module. You can also click the 'detail' link to see more
  200. information about each log entry.
  201. At the top of this page, click the "Statistics" link to see learn more about
  202. how the spam filter is performing. At this time only raw data is collected,
  203. but at a future time we plan to provide useful reports showing the effectiveness
  204. of the spam filter modules.
  205. Finally, click the "Blocked IPs" tab to see a list of all IP addresses that
  206. are currently being blocked by the spam filter. This page will also show how
  207. many times a given IP address has been blocked from posting content, as well
  208. as the last time the IP address was blocked.
  209. --------------
  210. Configuration:
  211. --------------
  212. Initial configuration of this module is documented in INSTALL.txt.
  213. Configuration of the module is done at "Administer >> Site configuration >>
  214. Spam". On this page, you can tell the module which types of content should
  215. be scanned. You can also tell the module which actions it should take when
  216. spam is detected.
  217. Advanced configuration:
  218. -----------------------
  219. It is generally recommended that you do not make any changes to the advanced
  220. configuration options.
  221. The spam threshold is used to decide what content is spam. All content is
  222. assigned a score from 1 to 99. Any content with a score that is equal to or
  223. greater than the spam threshold is considered to be spam. Any content with a
  224. score that is less than the spam threshold is considered to not be spam.
  225. Changing the spam threshold can have negative consequences, especially on
  226. websites that have been operating for a long time with a different spam
  227. threshold. Old content that has already been scanned will not be affected
  228. when you change the spam threshold -- this setting only affects new content.
  229. When trying to learn how the spam filters work, or trying to understand why
  230. content is incorrectly slipping through the filters or being marked as spam,
  231. it can be helpful to change the log level. The debug log level will provide
  232. you with a huge amount of information about each piece of content that is
  233. scanned by your filters, but it will also result in a large database load
  234. from writing all of these logs.
  235. Many individual spam filters also have their own configuration which is
  236. already defined earlier in this document.
  237. ------
  238. Other:
  239. ------
  240. TODO: Describe how to add custom CSS tags to your theme (override theme_comment)