Download CS351 Spring 2004, Project 1 The Bayesian Spam Filter
Transcript
CS351 Spring 2004, Project 1 The Bayesian Spam Filter MondoSoft, Inc. January 26, 2004 STATUS Specification and requirements document VERSION 1.0 DATE Jan 26, 2004 0 Changelog Version 1.0 Initial release. Jan 26, 2004. 1 Summary In response to rising complaints about Unsolicited Bulk Email (UBE, also known as “spam”), MondoSoft Inc.1 has decided to market a spam filter program, SpamBGon (TM), based on Bayesian statistical decision-making technology. To ensure interoperability of this filter program with other software, it must be capable of interfacing with standard UNIX mail program tools such as procmail (1). Furthermore, MondoSoft has recognized the business opportunity to remarket a sub-component of this program as a high-performance standalone software module. Therefore, this project also includes MondoHashTable, a fully functional hashtable implementation of the java.util.Map interface. To improve marketability and demonstrate these programs’ superior performances, both components (the hashtable and the full SpamBGon suite) will also be accompanied by rigorous scientific empirical performance evaluations2 . 2 Definitions ANALYZABLE The sections of an email message subject to tokenization and statistical analysis. The ANALYZABLE section consists of the BODY, plus the FIELD-BODY of the HEADERS “From”, “To”, and “Subject”. 1 Your employer. Unfortunately, because MondoSoft’s VCs pulled out during the bubble burst, MondoSoft is a bit strapped for cash to cover independent laboratory certification, so you get stuck with this job too. 2 1 BODY The main body of an email message; the part of an email message outside the HEADERs. Refer to RFC822 for details. CLASSIFICATION The process of using statistics accumulated during TRAINING to label an UNLABELED message as SPAM or NORMAL. EVIL See SPAM. FEATURE Some measurable characteristic of an email message, such as the presence or absence of a specific word, the length of a line, etc. FIELD-BODY The contents of a HEADER, following the FIELD-NAME. Refer to RFC822 for details. FIELD-NAME The sequence of characters that identifies a field (i.e., specific HEADER) within the HEADERS section of an email message. HEADER An email header or field, specifing information such as routing, date and time, subject, etc. Refer to RFC822 for details. MAILBOX A folder format for storing multiple email messages that is widely used under Unix (e.g., by mail, mailx, pine, mutt, etc.). A MAILBOX consists of zero or more email messages concatenated together, separated by (single) blank lines. Each new email message is recognized by the presence of the token “From ” at the beginning of a line. No other structure or control information is imposed on the file. MAY A requirement that the product can choose to implement if desired. Can also indicate a choice among acceptable alternatives (e.g., “The program MAY do x, y, or z.” indicates that the choice of behavior x, y, or z is up to the designer.) MUST A requirement that the product must implement for full credit. MUST NOT A behavior or assumption that must not be violated. Violating a MUST NOT restriction will result in a penalty on the assignment. NORMAL Email that the USER wishes to receive. PRIOR Or prior frequency estimate. The expected frequency of SPAM or NORMAL emails after TRAINING but before looking at a specific email message. Represents the proportion of SPAM and NORMAL email messages in the training data. Equivalent to the terms Pr[CS ] and Pr[CN ]. POSTERIOR Or posterior frequency estimate. The conditional probability estimate of a specific message being SPAM or NORMAL after analyzing the contents of the message. Equivalent to the terms Pr[CS |X] and Pr[CN |X]. PUNCTUATION Punctuation characters. For the purposes of this project, the punctuation characters are considered to be any characters other than WHITESPACE, letters, or digits. 2 RECOVERABLE ERROR An error condition that the software can ignore, correct, or otherwise recover from. The program MUST produce a warning message and then cleanly continue with no corruption or loss of valid data. RFC822 Document that specifies the syntax of standard internet email messages. Refer to this document for all specifications related to the format of email messages. Available at http: //www.ietf.org/rfc/rfc0822.txt. SPAM Email that the USER does not wish to receive. TRAINING Mode or stage in which the software compiles statistics from data that is known to be SPAM or NORMAL. UNLABELED An email message whose content (SPAM or NORMAL) is unknown. UNRECOVERABLE ERROR An error condition from which recovery is impossible. The program MUST produce an error message describing the condition and then cleanly halt. USER A single computer user or email recipient (potentially a mailing list). WHITESPACE Non-printable characters including (but not limited to) space, horizontal and vertical tabs, newlines, and carriage returns. C.f., the Java JDK API call Character.isWhitespace() 3 Requirements This section describes the elements that MUST be developed as part of this project. The designer MAY also choose to implement additional Java source files, programs, and/or shell scripts in support of the following items. This section only describes the general performance requirements for each element; for specific deliverable requirements, please refer to Section 5. 3.1 Hashtable The fundamental unit of statistical analysis for the Bayesian spam filter is the statistic for an individual token, which boils down to counts of the number of occurrences of each distinct token seen in the TRAINING data (see Appendix A for details). To track the mapping between tokens and their counts, the SpamBGon suite will use a hash mapping, specifically, a from-scratch implementation of the java.util.Map interface, as documented in the Java 1.4.1 API specification. This module will be named MondoHashTable.java and MUST support the complete java.util.Map interface and contract specification. The MondoHashTable implementation MUST NOT use, access, refer to, or rely on the AbstractMap or any other implementation of the Map interface. The MondoHashTable implementation MAY employ the java.util.AbstractSet implementation to support the Map.keySet() and/or Map.values() operations. As part of the project deliverables, the developer MUST demonstrate the performance of the MondoHashTable and show that it meets the quantitative requirements given in Section 4. To do so, it will probably be necessary to provide additional data members, methods, or subclasses 3 to track quantities such as number of allocations and reallocations, number of accesses, wall clock time, etc. The choice of which data/methods/subclasses to provide is up to the developer, but all such entities MUST be documented in the API documentation (c.f., Section 5.1). The MondoHashTable will form the core of the first milestone submission; refer to Section 5.1 for details on the full submission requirements. 3.2 Tokenizers The job of a tokenizer is to split the ANALYZABLE sections of an email message into small chunks, called tokens, which are the basic units of analysis. Many different types of tokens are possible—individual words, words plus punctuation characters, short strings of characters, HTML entities, MIME attachments, etc. A program can obtain different streams of tokens and, thus, different statistics from the same email message by changing the definition of a token (i.e., by changing the tokenizer module that turns an email message into tokens). The SpamBGon software suite MUST provide at least three tokenizers: NGramTokenizer This tokenizer splits the ANALYZABLE section of an email message into tokens of n contiguous characters, where n is a parameter to the tokenizer. This tokenizer MUST be able to omit WHITESPACE and PUNCTUATION characters. It MAY also provide user-selectable functionality to include WHITESPACE and/or PUNCTUATION characters. WhiteSpaceTokenizer This tokenizer splits the ANALYZABLE section of an email message at WHITESPACE characters. Essentially, the goal of this tokenizer is to split out individual “words”. This tokenizer MUST discard WHITESPACE and PUNCTUATION characters, though it MAY also provide a user-selectable option to preserve PUNCTUATION characters. One other tokenizer Choice of this tokenizer is a design decision, but it MUST be documented, described, motivated (i.e., a rationale given for why it might be a useful tokenizer), and empirically tested. Possibilities for this tokenizer include (but are not limited to) a recognizer for dates and times (from the Date header), a tokenizer that recognizes HTML tags and their contents, a tokenizer that recognizes MIME messages as single entities, a tokenizer based on n-grams of words (rather than characters), etc. These tokenizers MUST be interchangable; the USER MUST be able to select among the tokenizers at the command line during TRAINING and CLASSIFICATION. When a tokenizer requires additional parameters (e.g., the parameter n for the NGramTokenizer), the program MUST provide a command-line interface to set such parameters. 3.3 Spam Filter The Bayesian Spam Filter program suite MUST provide at least two client programs, BSFTrain and BSFTest. The suite MAY also provide additional programs for additional functionality, at the designer’s option. Each program’s interface is described below, followed by common options that MUST be supported by both programs. 4 BSFTrain This client is responsible for analyzing known examples of NORMAL and SPAM emails, building the naı̈ve Bayes statistical models for each, and saving a durable copy of those models. This program also provides an interface to display a summary of the contents of the two naı̈ve Bayes models. This program MUST accept email inputs in the format specified by RFC822 as training input. It MUST also accept MAILBOX format files as training input, and recognize the individual messages that occur within the file as separate email entities. It MUST correctly handle zero-length input files. This program MAY accept its training data files from standard input, but MAY provide direct file access to them (via an appropriately chosen command-line option). It MUST support some form of durable storage for the statistical models. The developer MAY use the Java serialization mechanism to implement this durable storage. BSFTrain MAY save its statistics in a single, combined file or in two separate files. This program MUST be capable of loading previously produced statistics files, updating them with new data, and saving the new updated files. This program MUST NOT lose data from previous TRAINING sessions during this operation—it MUST only add new data to existing. BSFTrain MUST maintain two statistical models—one for SPAM and one for NORMAL email. When this program is initially run (i.e., if no statistics files exist at first), it MUST construct new, default statistics tables. The designer MAY choose to require a separate, “initialization” invocation to create and initialize the statistics files before TRAINING, or MAY choose to have that operation happen transparently. This program MUST be capable of producing a human-readable dump of the current (possibly default) statistical models. The designer MAY choose any reasonable format for the dump, but the dump MUST, at the minimum, provide information about the relative frequency of all known tokens under both SPAM and NORMAL classes, the class PRIORs, and the total number of tokens read. The BSFTrain client MUST support the following command-line options: -s Treat input data as SPAM. -n Treat input data as NORMAL. -t Run in TRAINING mode. Compiles input data into statistics and updates the existing statistics tables (if any) with the new data. -d Run in dump statistics mode. Produce a summary report that gives the statistics for both SPAM and NORMAL email. This report MAY be printed to standard output or MAY be written to a log file. If BSFTrain supports writing the dump to a log file, it MUST provide a command-line option for choosing that output file. This program MUST also support the options listed below under Common Options. BSFTrain MAY implement additional command-line options of the designer’s choice, but such options MUST NOT conflict with the options given above or the options listed below under Common Options. 5 BSFTest This client is responsible for analyzing a single, UNLABELED email message and labeling it as SPAM or NORMAL. This program MUST read its email message from standard input and write its (normal) output to standard output. Its assessment of the label for the email (either SPAM or NORMAL) MUST be written as an “X-Spam-Status:” HEADER, including the token SPAM or NORMAL and the likelihoods for each class. For example, X-Spam-Status: NORMAL, ll(SPAM)=-2478.34, ll(NORMAL)=-1893.28 or X-Spam-Status: SPAM, ll(SPAM)=-3789.6, ll(NORMAL)=-4279.62 BSFTest MAY also emit additional “X-” headers of the designer’s choice. All such headers MUST be fully documented in the user documentation. This program MUST accept the options listed below under Common Options. It MAY also accept other options that do not conflict with those, at the designer’s option. If BSFTest is invoked with no previously trained model (i.e., no statistics file(s) generated by BSFTrain), it MAY treat this as a RECOVERABLE or UNRECOVERABLE error, or it MAY silently initialize the appropriate statistics internally. If it chooses to initialize the statistics, it MAY write them out to file analysis or it MAY discard them. Common Options Both client programs MUST support the following command-line options: -m modelFileName Load/save SPAM/NORMAL statistical models from/to the specified filename. If BSFTrain saves all statistical models to a single file, the filename should be modelFileName.stat; if it uses separate files for SPAM and NORMAL statistics, their respective file names should be modelFileName.sstat and modelFileName.nstat. -k tokenizerName This selects the type of tokenizer to be used in the current analysis. The program MAY also require additional command-line arguments specific to each tokenizer (e.g., the NGramTokenizer requires that the parameter n be set). Failyure to provide a necessary tokenizer-specific option MAY be treated as a RECOVERABLE or UNRECOVERABLE ERROR. If a program detects a discrepency between the type of tokenizer specified by -k and the type of tokenizer previously used to construct the saved statistics tables (if any), it MAY choose to interpret this as a UNRECOVERABLE or (if possible) a RECOVERABLE ERROR. It MUST NOT crash or produce incorrect results because of this condition. It MUST NOT destroy, corrupt, or overwrite previously existing saved statistics file(s). 4 Quantitative Requirements This section describes the performance and IP requirements for the SpamBGon software suite. • All programs MUST NOT crash, core dump, dump a stack trace, or throw an exception on any input. 6 • In the case of a RECOVERABLE ERROR, a program MUST issue a warning statement and continue processing. The program MAY choose to issue the warning statement to standard error or to a log file. If the warning is issued to a log file, the log file name and location MUST be a user-specifiable parameter to the program. • In the case of an UNRECOVERABLE ERROR, a program MUST issue an error statement and terminate with a non-zero error condition. The program MAY use different exit codes to indicate different error conditions, but such codes MUST be documented in the user manual. The error message MUST be logged to the same destination that warning messages (from RECOVERABLE ERRORS) are. • In the case of any ERROR, a program MUST NOT delete, corrupt, or damage existing statistics model files or any other “stateful” files employed by the program suite. • The MondoHashTable.java module MUST NOT use or reference the Hashtable, HashMap, AbstractMap, HashSet, TreeSet, or any of their subclasses. • For (substantially) reduced credit, BFSTrain and BFSTest MAY use the HashMap class in place of MondoHashTable. Note that this requirement exists only as an aid in case the programmer has difficulty getting MondoHashTable to work properly; for full credit the entire SpamBGon suite MUST employ MondoHashTable and MUST NOT employ or refer to any of the classes listed in the previous bullet point. • The entire program suite MUST NOT employ or refer to the StreamTokenizer class. • The programs MAY provide additional output for debugging purposes, but such output must be disabled by default. Any program MAY provide a command-line switch to enable debugging support when desired. • The SpamBGon suite MAY use the gnu.getopt.Getopt and gnu.getopt.LongOpt classes to assist in handling command-line options. • The programmer MAY ask permission of the instructor or the TA to use any classes outside the JDK that have not already been mentioned. The final programs MUST NOT use any class outside the JDK that have not been explicitly allowed. • The SpamBGon suite MAY assume that all valid input is standard ASCII text in the range (char)0–(char)127, inclusive. If a program encounters a character outside this range, it MAY treat it it as a RECOVERABLE or UNRECOVERABLE ERROR or silently ignore it. If such characters are treated as RECOVERABLE or ignored, they MUST NOT disrupt the otherwise normal functioning of the program. • All programs MUST NOT assume that all input is validly structured email. If a program encounter non-email input (e.g., lacking or corrupted HEADERS, invalid character sets, improper MIME boundaries, etc.) it MAY produce a RECOVERABLE or UNRECOVERABLE ERROR, but it MUST NOT crash, corrupt the statistics files, etc. If a program chooses to RECOVER from an ill-formed email, it MUST NOT corrupt the statistics tables with information from the illegal input; it MUST wait for the next valid input before continuing to update statistics tables. 7 • Both BFSTrain and BFSTest programs MUST run in amortized O(n) time for email input of size n. • The MondoHashTable MUST support get(), put(), remove(), size(), and isEmpty() in amortized O(1) time. The table MAY support key/value iteration in time proportional to the capacity of the table. For extra credit, it MAY support key/value iteration in time proportional to the number of keys/values (respectively). To receieve the extra credit, the designer must demonstrate this convincingly in the performance documentation. • The MondoHashTable MUST NOT consume more than O(n · s) memory for n distinct keys, where s represents the combined size of a key/value pair. • The MondoHashTable MUST support the keySet() and values() operations with only O(1) space above that required by the hashtable itself. Specifically, these operations MUST NOT replicate the underlying hashtable, nor duplicate any keys or values. • All user documentation MUST be grammatically correct and include correct spelling and usage. Notably, “Bayes” was a real person so all terminology including his name must be capitalized. E.g., “Bayesian spam analysis”, “naı̈ve Bayes”, etc. • The programmer MUST document any areas in which her or his software suite does not meet this specification. WARNING! The grade penalty will be higher if the instructors discover an undocumented program shortcoming or bug than if it is documented up front. 5 Deliverables This section describes the content to be delivered at each stage of the project (one milestone and a final rollout). For the deadlines of these stages, please refer to Section 6. 5.1 Milestone 1: MondoHashTable The first project component due is the MondoHashTable implementation. The deliverables for this milestone are: MondoHashTable.java The main class file for the MondoHashTable implementation. Other Java source files Any other supporting code files necessary to compile, load, and use the MondoHashTable module. API documentation The handin MUST also include the full, compiled JavaDoc documentation for the MondoHashTable implementation. This documentation MUST include full descriptions of every public or protected method, field, sub-class, enclosed class, or constructor employed by MondoHashTable. This documentation hierarchy MUST be included in a sub-directory named documentation/ within the submission tarball package. 8 Performance documentation The handin submission MUST include a document describing the performance of the hashtable implementation and demonstrating (via empirical experiments) that it meets the quantitative performance goals established in Section 4 of this document. The designer MAY choose any tests that he or she desires to establish the performance of his/her MondoHashTable, but MUST describe all tests and why they lead to the stated conclusions about performance. This document MUST be named PERFORMANCE.extension, but it MAY be a plain text, HTML, PDF, or PostScript document (with the appropriate extension). It MUST NOT be a Microsoft Word or other nonportable format document. Test cases The submission tarball MUST include a subdirectory named tests/ that includes all of the test data used to demonstrate the performance of the hashtable implementation. CVS log file(s) For each Java source file, the submission tarball MUST include a corresponding .log file including the CVS log for that sourcecode. E.g., for the file MondoHashTable.java, the following CVS command will produce the appropriate log output: cvs log MondoHashTable.java > MondoHashTable.log At the programmer’s option, this submission MAY also include: BUGS.TXT This file documents any known outstanding bugs, missing features, peformance problems, or failures to meet specifications of your submission. Note that the penalty for such problems will be smaller if they’re fully documented here than if the instructors discover them independently. The submission directory MUST be named lastname_p1m1 and the submission tarball MUST be named lastname_p1m1.tar.gz. 5.2 Rollout: The SpamBGon Suite The second and final stage of the project is the rollout of the completed project. The deliverables for this stage are: BSFTrain.java and BSFTest.java The two primary programs of the SpamBGon suite. Other Java source files Any other supporting code files necessary to compile, load, and use the BSFTrain and BSFTest programs. Note: if these programs depend on external library code other than the Java JDK or the gnu.getopt suite, the submission tarball MUST either include the library whole or provide easy and explicit instructions on how and where to access such libraries. This documentation MUST be provided in the README.TXT file. The designer is responsible for ensuring that all copyright and distribution conditions are adhered to. README.TXT This file MUST describe how to compile, configure, and install the SpamBGon suite. It MUST also list any dependencies on additional software support libraries. 9 Internal documentation The handin MUST also include the full, compiled JavaDoc documentation for all Java source files in the submission tarball. This documentation MUST include full descriptions of every public or protected method, field, sub-class, enclosed class, or constructor employed by the code. This documentation hierarchy MUST be included in a sub-directory named documentation/ within the submission tarball package. User documentation The handin submission MUST include complete user-level documentation for the SpamBGon suite. This documentation MUST include instructions on how to use both BSFTrain and BSFTest including the functionality of all command-line options. The documentation MUST also describe the function and use of any additional programs included in the submission. User documentation MUST include information on the expected inputs and outputs of all programs, how to read and interpret the output, and information on all status and error messages that the programs could produce. This documentation MUST also include at least one example of how to run each program and how to interpret the output. This document MUST be named USERDOC.extension, but it MAY be be a plain text, HTML, PDF, or PostScript document (with the appropriate extension). It MUST NOT be a Microsoft Word or other nonportable format document. Performance documentation The handin submission MUST include a document describing the performance of the SpamBGon suite, including its ability to differentiate SPAM from NORMAL email under different amounts of TRAINING data and under different tokenizers (including a small range of reasonable parameters for each parameterized tokenizer). This document MUST also include the designer’s assessment of which tokenizer is superior and why or, if different tokenizers are superior under different conditions, what conditions are important to the success of each. The designer MAY choose any tests that she or he desires to establish the performance of her/his SpamBGon suite, but MUST describe all tests and why they lead to the stated conclusions about performance. Finally, this document MUST include the designer’s assessment of how to improve the performance of the system (e.g., what other kind of tokenizer might be helpful, how to change the probability equations to improve accuracy, etc.) This document MUST be named PERFORMANCE.extension, but it MAY be a plain text, HTML, PDF, or PostScript document (with the appropriate extension). It MUST NOT be a Microsoft Word or other nonportable format document. Test cases The submission tarball MUST include a subdirectory named tests/ that includes all of the test data used to demonstrate the performance of the SpamBGon suite. CVS log file(s) For each Java source file, the submission tarball MUST include a corresponding .log file including the CVS log for that sourcecode. At the programmer’s option, this submission MAY also include: BUGS.TXT This file documents any known outstanding bugs, missing features, peformance problems, or failures to meet specifications of your submission. Note that the penalty for such problems will be smaller if they’re fully documented here than if the instructors discover them independently. 10 Note that if the MondoHashTable code is not fully functional for Milestone 1, a revised version MAY be submitted in this handin. If MondoHashTable has been revised for this version, this submission tarball MUST include the necessary supporting documentation described under Milestone 1, as well as notes describing the added functionality/improvements between Milestone 1 and this handin. 6 Timeline Jan 26 Project specification handed out. Feb 6, 5:00 PM MondoHashTable component due. Feb 20, 5:00 PM Full project due. Appendix A: Bayesian Spam Analysis The spam classification method you will be using is based on a Bayesian statistical model known as the “naı̈ve Bayes” model. It’s based on estimating the probabilities that a given UNLABELED message is either SPAM or NORMAL. More specifically, for an UNLABELED message, X, you must evaluate the quantities Pr[CN |X] and Pr[CS |X], where CN denotes the class of NORMAL email messages and CS is the class of SPAM email messages. If you find that Pr[CN |X] > Pr[CS |X] (1) , you can label the message X NORMAL, otherwise you label it SPAM. The trick is finding Pr[Ci |X]. Your program (specifically, BSFTrain) will estimate them by looking at a great many NORMAL and SPAM emails, called TRAINING DATA. The problem is that, even given a bunch of example emails, it’s not immediately obvious what this conditional probability might be. Through the magic of Bayes’ rule, however, we can turn this around: Pr[Ci |X] = Pr[X|Ci ] Pr[Ci ] Pr[X] (2) The quantities Pr[CN |X] and Pr[CS |X] are called posterior probability estimates—posterior because they’re the probabilities you assign after you see the data (i.e., after you get to look at X). The quantity Pr[Ci ] is called the prior probability of class i, or simply the prior. This is the probability you would assign to a particular message being SPAM or NORMAL before you look at the contents of the message. The quantity Pr[X] is the “raw” data likelihood. Essentially, it’s the probability of a particular message occuring, across both SPAM and NONSPAM. The quantity Pr[X|Ci ] is called the generative model for X given class i. It seems like we’ve taken a step backward. We now have three quantities to calculate rather than one. Fortunately, in this case, these three are simpler than the original one. First off, if all we want to do is classify the data, via Equation 1, then we can discard the raw data likelihood, Pr[X] (to see this, plug Equation 2 into 1). Second, the term Pr[Ci ] is easy—it’s just the relative 11 probability of a message being SPAM or NORMAL, i.e., the frequency of SPAM or NORMAL emails you’ve seen: # NORMAL emails total # emails # NORMAL emails = # SPAM + # NORMAL # SPAM emails Pr[CS ] = total # emails # SPAM emails = # SPAM + # NORMAL Pr[CN ] = That leaves the generative model. Note that if I hand you a message and say “that’s spam”, you have a sample of Pr[X|CS ]. Your job is to assemble a bunch of such samples to create a comprehensive model of the data for each class. Essentially, Pr[X|Ci ] tells you what the chance is that you see a particular configuration of letters and words within the universe of all messages of class Ci . Here we make a massive approximation. First, let’s break up the message X into a set of lower level elements, called features: X = hx1 , x2 , . . . , xk i. In the case of email, a feature might be a single character, a word, an HTML token, a MIME attachment, the length of a line, time of day the mail was sent, etc. For the moment, we won’t worry about what a feature is (it’s the tokenizer’s job to determine that—see Section 3.2 for details); all we’ll care is that you have some way to break it down into more fundamental pieces. Now we’ll write: Pr[X|Ci ] = Pr[x1 , x2 , . . . , xk |Ci ] ≈ Pr[x1 |Ci ] Pr[x2 |Ci ] · · · Pr[xk |Ci ] = k Y Pr[xj |Ci ] (3) (4) (5) j=1 This is called the naı̈ve Bayes approximation. It’s naı̈ve because it is a drastic approximation (for example, it discards any information about the order among words), but it turns out to work surprisingly well in practice in a number of cases. You can think about more sophisticated ways to approximate Pr[X|Ci ] if you like (I welcome your thoughts on the matter), but for this project it’s sufficient to stick with naı̈ve Bayes. Ok, so now we’ve blown out a single term that we didn’t know how to calculate into a long product of terms. Is our life any better? Yes! Because each of those individual terms, Pr[xj |Ci ], is simply an observed frequency within the TRAINING data for the token xj —you can get it simply by counting: Pr[xj |Ci ] = # of tokens of type xj seen in class Ci total # of tokens seen in class Ci For example, suppose that your tokens are individual words. When you’re analyzing a new SPAM message during TRAINING, you find that the j th token is the word “tyromancy”. Your 12 probability estimate for “tyromancy” is just: Pr[“tyromancy”|CS ] = # “tyromancy” instances in all SPAM total # of tokens in all SPAM So when you’re TRAINING, every time you see a particular token, you increment the count of that token (and the count of all tokens) for that class. When you’re doing CLASSIFICATION, you don’t change the counts when you see a token. Instead, you just look up the appropriate counts and call that the probability of the token that you’re looking at. So to calculate Equation 5, you simply iterate across the message, taking each token, and multiplying its class-conditional probability into your total probability estimate for the corresponding class. In pseudo-code, Given: an email message, X, and a label Ci ∈ {CN , CS }, 1. break X into its tokens, hx1 , . . . , xk i 2. for each token, xj (a) Increment the counter for token xj for class Ci (b) Increment the count of total tokens in class Ci 3. Increment the total number of email messages for class Ci Figure 1: TRAINING pseudo-code Given: an UNLABELED email message, X 1. pN := Pr[CN ] 2. pS := Pr[CS ] 3. break X into its tokens, hx1 , . . . , xk i 4. for each token, xj (a) pN := pN · Pr[xj |CN ] (b) pS := pS · Pr[xj |CS ] 5. if pN > pS then return NORMAL 6. else return SPAM Figure 2: CLASSIFICATION pseudo-code And you’re done. There are, of course, an immense number of technical issues in turning this into a real program, but that’s the gist of it. One practical issue, however, is underflow—if the product in Equation 5 has very many terms, your probability estimates (pN and pS ) will quickly become 0 and it will be impossible to tell the difference between the two classes. To overcome this, 13 instead of working directly with the probabilities of tokens, we’ll work with the log likelihood of the tokens. I.e., we’ll replace Equation 5 with log(Equation 5). (Question 1: how does this change the algorithm in Figure 2? Question 2: does this leave the final classification unchanged? Why or why not?) A second critical issue is what to do if you see a token in an UNLABLED message that you’ve # xj never seen before. If all you’re doing is using Pr[xj |Ci ] = total tokens , then you have that Pr[xj |Ci ] = 0 if you’ve never seen token xj before in your TRAINING data. This is bad. (Question: why is this bad? Hint: consider what happens to Equation 5 if one or more terms are 0.) So instead, we’ll use an approximation to Pr[xj |Ci ] that avoids this danger: Pr[xj |Ci ] ≈ (# xj tokens in class Ci ) + 1 (total # of tokens in class Ci ) + 1 This is called a Laplace correction or, equivalently, a Laplace smoothing. (It also happens to be a special case of a Dirichlet prior, but we won’t go into that here.) Now you have enough mathematical background and tricks to implement the SPAM filter. The rest is Java... 14