Text mining has become an important research issue since the explosion of digital and online text information, where data might be stored as electronic documents or as text fields in databases, text mining has increased in importance and economic value. One of the important goals in text mining is automatic classification of electronic documents. Computer programs scan text in a document and apply a model that assigns the document to one or more of predefined classes (topics).The document is manipulated first by performing some natural language processing methods on them to remove unnecessary data from the document which does not convey any useful information, then removing any confusion through putting the words with the same meaning in single word to make the comparison between the documents more easily.Then, the second step used in this thesis is document classification.In this step the set of documents is divided into training and testing document subsets randomly, both training and testing subset are used for machine learning. A vector space model for Term Frequency Inverse Document Frequency (TFIDF) method together with similarity factor are used, in classifying both the training and testing documents. Finally, the performance of the used classification method will be measured.Reuters-21578 test collection was used to measure the performance of the proposed classification method in this thesis. The results obtained are encouraging.
Content Based Text Mining
number:
2126
إنجليزية
College:
department:
Degree:
Supervisor:
Dr.Taha S. Bashaga
year:
2009