Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Python 3 Text Processing with NLTK 3 Cookbook
Python 3 Text Processing with NLTK 3 Cookbook
Python 3 Text Processing with NLTK 3 Cookbook
Ebook866 pages6 hours

Python 3 Text Processing with NLTK 3 Cookbook

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

This book is intended for Python programmers interested in learning how to do natural language processing. Maybe you’ve learned the limits of regular expressions the hard way, or you’ve realized that human language cannot be deterministically parsed like a computer language. Perhaps you have more text than you know what to do with, and need automated ways to analyze and structure that text. This Cookbook will show you how to train and use statistical language models to process text in ways that are practically impossible with standard programming tools. A basic knowledge of Python and the basic text processing concepts is expected. Some experience with regular expressions will also be helpful.
LanguageEnglish
Release dateAug 26, 2014
ISBN9781782167860
Python 3 Text Processing with NLTK 3 Cookbook

Read more from Jacob Perkins

Related to Python 3 Text Processing with NLTK 3 Cookbook

Related ebooks

Programming For You

View More

Related articles

Reviews for Python 3 Text Processing with NLTK 3 Cookbook

Rating: 4 out of 5 stars
4/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Python 3 Text Processing with NLTK 3 Cookbook - Jacob Perkins

    Table of Contents

    Python 3 Text Processing with NLTK 3 Cookbook

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why Subscribe?

    Free Access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Tokenizing Text and WordNet Basics

    Introduction

    Tokenizing text into sentences

    Getting ready

    How to do it...

    How it works...

    There's more...

    Tokenizing sentences in other languages

    See also

    Tokenizing sentences into words

    How to do it...

    How it works...

    There's more...

    Separating contractions

    PunktWordTokenizer

    WordPunctTokenizer

    See also

    Tokenizing sentences using regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    Simple whitespace tokenizer

    See also

    Training a sentence tokenizer

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Filtering stopwords in a tokenized sentence

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Looking up Synsets for a word in WordNet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Working with hypernyms

    Part of speech (POS)

    See also

    Looking up lemmas and synonyms in WordNet

    How to do it...

    How it works...

    There's more...

    All possible synonyms

    Antonyms

    See also

    Calculating WordNet Synset similarity

    How to do it...

    How it works...

    There's more...

    Comparing verbs

    Path and Leacock Chordorow (LCH) similarity

    See also

    Discovering word collocations

    Getting ready

    How to do it...

    How it works...

    There's more...

    Scoring functions

    Scoring ngrams

    See also

    2. Replacing and Correcting Words

    Introduction

    Stemming words

    How to do it...

    How it works...

    There's more...

    The LancasterStemmer class

    The RegexpStemmer class

    The SnowballStemmer class

    See also

    Lemmatizing words with WordNet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Combining stemming with lemmatization

    See also

    Replacing words matching regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    Replacement before tokenization

    See also

    Removing repeating characters

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Spelling correction with Enchant

    Getting ready

    How to do it...

    How it works...

    There's more...

    The en_GB dictionary

    Personal word lists

    See also

    Replacing synonyms

    Getting ready

    How to do it...

    How it works...

    There's more...

    CSV synonym replacement

    YAML synonym replacement

    See also

    Replacing negations with antonyms

    How to do it...

    How it works...

    There's more...

    See also

    3. Creating Custom Corpora

    Introduction

    Setting up a custom corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Loading a YAML file

    See also

    Creating a wordlist corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Names wordlist corpus

    English words corpus

    See also

    Creating a part-of-speech tagged word corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Customizing the word tokenizer

    Customizing the sentence tokenizer

    Customizing the paragraph block reader

    Customizing the tag separator

    Converting tags to a universal tagset

    See also

    Creating a chunked phrase corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Tree leaves

    Treebank chunk corpus

    CoNLL2000 corpus

    See also

    Creating a categorized text corpus

    Getting ready

    How to do it...

    How it works...

    There's more...

    Category file

    Categorized tagged corpus reader

    Categorized corpora

    See also

    Creating a categorized chunk corpus reader

    Getting ready

    How to do it...

    How it works...

    There's more...

    Categorized CoNLL chunk corpus reader

    See also

    Lazy corpus loading

    How to do it...

    How it works...

    There's more...

    Creating a custom corpus view

    How to do it...

    How it works...

    There's more...

    Block reader functions

    Pickle corpus view

    Concatenated corpus view

    See also

    Creating a MongoDB-backed corpus reader

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Corpus editing with file locking

    Getting ready

    How to do it...

    How it works...

    4. Part-of-speech Tagging

    Introduction

    Default tagging

    Getting ready

    How to do it...

    How it works...

    There's more...

    Evaluating accuracy

    Tagging sentences

    Untagging a tagged sentence

    See also

    Training a unigram part-of-speech tagger

    How to do it...

    How it works...

    There's more...

    Overriding the context model

    Minimum frequency cutoff

    See also

    Combining taggers with backoff tagging

    How to do it...

    How it works...

    There's more...

    Saving and loading a trained tagger with pickle

    See also

    Training and combining ngram taggers

    Getting ready

    How to do it...

    How it works...

    There's more...

    Quadgram tagger

    See also

    Creating a model of likely word tags

    How to do it...

    How it works...

    There's more...

    See also

    Tagging with regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Affix tagging

    How to do it...

    How it works...

    There's more...

    Working with min_stem_length

    See also

    Training a Brill tagger

    How to do it...

    How it works...

    There's more...

    Tracing

    See also

    Training the TnT tagger

    How to do it...

    How it works...

    There's more...

    Controlling the beam search

    Significance of capitalization

    See also

    Using WordNet for tagging

    Getting ready

    How to do it...

    How it works...

    See also

    Tagging proper names

    How to do it...

    How it works...

    See also

    Classifier-based tagging

    How to do it...

    How it works...

    There's more...

    Detecting features with a custom feature detector

    Setting a cutoff probability

    Using a pre-trained classifier

    See also

    Training a tagger with NLTK-Trainer

    How to do it...

    How it works...

    There's more...

    Saving a pickled tagger

    Training on a custom corpus

    Training with universal tags

    Analyzing a tagger against a tagged corpus

    Analyzing a tagged corpus

    See also

    5. Extracting Chunks

    Introduction

    Chunking and chinking with regular expressions

    Getting ready

    How to do it...

    How it works...

    There's more...

    Parsing different chunk types

    Parsing alternative patterns

    Chunk rule with context

    See also

    Merging and splitting chunks with regular expressions

    How to do it...

    How it works...

    There's more...

    Specifying rule descriptions

    See also

    Expanding and removing chunks with regular expressions

    How to do it...

    How it works...

    There's more...

    See also

    Partial parsing with regular expressions

    How to do it...

    How it works...

    There's more...

    The ChunkScore metrics

    Looping and tracing chunk rules

    See also

    Training a tagger-based chunker

    How to do it...

    How it works...

    There's more...

    Using different taggers

    See also

    Classification-based chunking

    How to do it...

    How it works...

    There's more...

    Using a different classifier builder

    See also

    Extracting named entities

    How to do it...

    How it works...

    There's more...

    Binary named entity extraction

    See also

    Extracting proper noun chunks

    How to do it...

    How it works...

    There's more...

    See also

    Extracting location chunks

    How to do it...

    How it works...

    There's more...

    See also

    Training a named entity chunker

    How to do it...

    How it works...

    There's more...

    See also

    Training a chunker with NLTK-Trainer

    How to do it...

    How it works...

    There's more...

    Saving a pickled chunker

    Training a named entity chunker

    Training on a custom corpus

    Training on parse trees

    Analyzing a chunker against a chunked corpus

    Analyzing a chunked corpus

    See also

    6. Transforming Chunks and Trees

    Introduction

    Filtering insignificant words from a sentence

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Correcting verb forms

    Getting ready

    How to do it...

    How it works...

    See also

    Swapping verb phrases

    How to do it...

    How it works...

    There's more...

    See also

    Swapping noun cardinals

    How to do it...

    How it works...

    See also

    Swapping infinitive phrases

    How to do it...

    How it works...

    There's more...

    See also

    Singularizing plural nouns

    How to do it...

    How it works...

    See also

    Chaining chunk transformations

    How to do it...

    How it works...

    There's more...

    See also

    Converting a chunk tree to text

    How to do it...

    How it works...

    There's more...

    See also

    Flattening a deep tree

    Getting ready

    How to do it...

    How it works...

    There's more...

    The cess_esp and cess_cat treebank

    See also

    Creating a shallow tree

    How to do it...

    How it works...

    See also

    Converting tree labels

    Getting ready

    How to do it...

    How it works...

    See also

    7. Text Classification

    Introduction

    Bag of words feature extraction

    How to do it...

    How it works...

    There's more...

    Filtering stopwords

    Including significant bigrams

    See also

    Training a Naive Bayes classifier

    Getting ready

    How to do it...

    How it works...

    There's more...

    Classification probability

    Most informative features

    Training estimator

    Manual training

    See also

    Training a decision tree classifier

    How to do it...

    How it works...

    There's more...

    Controlling uncertainty with entropy_cutoff

    Controlling tree depth with depth_cutoff

    Controlling decisions with support_cutoff

    See also

    Training a maximum entropy classifier

    Getting ready

    How to do it...

    How it works...

    There's more...

    Megam algorithm

    See also

    Training scikit-learn classifiers

    Getting ready

    How to do it...

    How it works...

    There's more...

    Comparing Naive Bayes algorithms

    Training with logistic regression

    Training with LinearSVC

    See also

    Measuring precision and recall of a classifier

    How to do it...

    How it works...

    There's more...

    F-measure

    See also

    Calculating high information words

    How to do it...

    How it works...

    There's more...

    The MaxentClassifier class with high information words

    The DecisionTreeClassifier class with high information words

    The SklearnClassifier class with high information words

    See also

    Combining classifiers with voting

    Getting ready

    How to do it...

    How it works...

    See also

    Classifying with multiple binary classifiers

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Training a classifier with NLTK-Trainer

    How to do it...

    How it works...

    There's more...

    Saving a pickled classifier

    Using different training instances

    The most informative features

    The Maxent and LogisticRegression classifiers

    SVMs

    Combining classifiers

    High information words and bigrams

    Cross-fold validation

    Analyzing a classifier

    See also

    8. Distributed Processing and Handling Large Datasets

    Introduction

    Distributed tagging with execnet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Creating multiple channels

    Local versus remote gateways

    See also

    Distributed chunking with execnet

    Getting ready

    How to do it...

    How it works...

    There's more...

    Python subprocesses

    See also

    Parallel list processing with execnet

    How to do it...

    How it works...

    There's more...

    See also

    Storing a frequency distribution in Redis

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Storing a conditional frequency distribution in Redis

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Storing an ordered dictionary in Redis

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Distributed word scoring with Redis and execnet

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    9. Parsing Specific Data Types

    Introduction

    Parsing dates and times with dateutil

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Timezone lookup and conversion

    Getting ready

    How to do it...

    How it works...

    There's more...

    Local timezone

    Custom offsets

    See also

    Extracting URLs from HTML with lxml

    Getting ready

    How to do it...

    How it works...

    There's more...

    Extracting links directly

    Parsing HTML from URLs or files

    Extracting links with XPaths

    See also

    Cleaning and stripping HTML

    Getting ready

    How to do it...

    How it works...

    There's more...

    See also

    Converting HTML entities with BeautifulSoup

    Getting ready

    How to do it...

    How it works...

    There's more...

    Extracting URLs with BeautifulSoup

    See also

    Detecting and converting character encodings

    Getting ready

    How to do it...

    How it works...

    There's more...

    Converting to ASCII

    UnicodeDammit conversion

    See also

    A. Penn Treebank Part-of-speech Tags

    Index

    Python 3 Text Processing with NLTK 3 Cookbook


    Python 3 Text Processing with NLTK 3 Cookbook

    Copyright © 2014 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: November 2010

    Second edition: August 2014

    Production reference: 1200814

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78216-785-3

    www.packtpub.com

    Cover image by Faiz Fattohi (<faizfattohi@gmail.com>)

    Credits

    Author

    Jacob Perkins

    Reviewers

    Patrick Chan

    Mohit Goenka

    Lihang Li

    Maurice HT Ling

    Jing (Dave) Tian

    Commissioning Editor

    Kevin Colaco

    Acquisition Editor

    Kevin Colaco

    Content Development Editor

    Amey Varangaonkar

    Technical Editor

    Humera Shaikh

    Copy Editors

    Deepa Nambiar

    Laxmi Subramanian

    Project Coordinator

    Leena Purkait

    Proofreaders

    Simran Bhogal

    Paul Hindle

    Indexers

    Hemangini Bari

    Mariammal Chettiyar

    Tejal Soni

    Priya Subramani

    Graphics

    Ronak Dhruv

    Disha Haria

    Yuvraj Mannari

    Abhinash Sahu

    Production Coordinators

    Pooja Chiplunkar

    Conidon Miranda

    Nilesh R. Mohite

    Cover Work

    Pooja Chiplunkar

    About the Author

    Jacob Perkins is the cofounder and CTO of Weotta, a local search company. Weotta uses NLP and machine learning to create powerful and easy-to-use natural language search for what to do and where to go.

    He is the author of Python Text Processing with NLTK 2.0 Cookbook, Packt Publishing, and has contributed a chapter to the Bad Data Handbook, O'Reilly Media. He writes about NLTK, Python, and other technology topics at http://streamhacker.com.

    To demonstrate the capabilities of NLTK and natural language processing, he developed http://text-processing.com, which provides simple demos and NLP APIs for commercial use. He has contributed to various open source projects, including NLTK, and created NLTK-Trainer to simplify the process of training NLTK models. For more information, visit https://github.com/japerk/nltk-trainer.

    I would like to thank my friends and family for their part in making this book possible. And thanks to the editors and reviewers at Packt Publishing for their helpful feedback and suggestions. Finally, this book wouldn't be possible without the fantastic NLTK project and team: http://www.nltk.org/.

    About the Reviewers

    Patrick Chan is an avid Python programmer and uses Python extensively for data processing.

    I would like to thank my beautiful wife, Thanh Tuyen, for her endless patience and understanding in putting up with my various late night 
hacking sessions.

    Mohit Goenka is a software developer in the Yahoo Mail team. Earlier, he graduated from the University of Southern California (USC) with a Master's degree in Computer Science. His thesis focused on Game Theory and Human Behavior concepts as applied in real-world security games. He also received an award for academic excellence from the Office of International Services at the University of Southern California. He has showcased his presence in various realms of computers including artificial intelligence, machine learning, path planning, multiagent systems, neural networks, computer vision, computer networks, and operating systems.

    During his tenure as a student, he won multiple competitions cracking codes and presented his work on Detection of Untouched UFOs to a wide range of audience. Not only is he a software developer by profession, but coding is also his hobby. He spends most of his free time learning about new technology and developing his skills.

    What adds feather to his cap is his poetic skills. Some of his works are part of the University of Southern California Libraries archive under the cover of The Lewis Carroll collection. In addition to this, he has made significant contributions by volunteering his time to serve the community.

    Lihang Li received his BE degree in Mechanical Engineering from Huazhong University of Science and Technology (HUST), China, in 2012, and now is pursuing his MS degree in Computer Vision at National Laboratory of Pattern Recognition (NLPR) from the Institute of Automation, Chinese Academy of Sciences (IACAS).

    As a graduate student, he is focusing on Computer Vision and specially on vision-based SLAM algorithms. In his free time, he likes to take part in open source activities and is now the President of the Open Source Club, Chinese Academy of Sciences. Also, building a multicopter is his hobby and he is with a team called OpenDrone from BLUG (Beijing Linux User Group).

    His interests include Linux, open source, cloud computing, virtualization, computer vision, operating systems, machine learning, data mining, and a variety of programming languages.

    You can find him by visiting his personal website http://hustcalm.me.

    Many thanks to my girlfriend Jingjing Shao, who is always with me. Also, I must thank the entire team at Packt Publishing, I would like to thank Kartik who is a very good Project Coordinator. I would also like to thank the other reviewers; though we haven't met, I'm really happy working with you.

    Maurice HT Ling completed his PhD in Bioinformatics and BSc (Hons) in Molecular and Cell Biology from The University of Melbourne. He is currently a Research Fellow in Nanyang Technological University, Singapore, and an Honorary Fellow in The University of Melbourne, Australia. He co-edits The Python Papers and co-founded the Python User Group (Singapore), where he has been serving as the executive committee member since 2010. His research interests lie in life—biological life, and artificial life and artificial intelligence—and in using computer science and statistics as tools to understand life and its numerous aspects. His personal website is http://maurice.vodien.com.

    Jing (Dave) Tian is now a graduate research fellow and a PhD student in the Computer and Information Science and Engineering (CISE) department at the University of Florida. His research direction involves system security, embedded system security, trusted computing, and static analysis for security and virtualization. He is interested in Linux kernel hacking and compilers. He also spent a year on AI and machine learning directions and taught classes on Intro to Problem Solving using Python and Operating System in the Computer Science department at the University of Oregon. Before that, he worked as a software developer in the Linux Control Platform (LCP) group in Alcatel-Lucent (former Lucent Technologies) R&D for around 4 years. He has got BS and ME degrees of EE in China. His website is http://davejingtian.org.

    I would like to thank the author of the book, who has made a good job for both Python and NLTK. I would also like to thank to the editors of the book, who made this book perfect and offered me the opportunity to review such a nice book.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    You might want to visit www.PacktPub.com for support files and downloads related to your book.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    http://PacktLib.PacktPub.com

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.

    Why Subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print and bookmark content

    On demand and accessible via web browser

    Free Access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

    Preface

    Natural language processing is used everywhere, from search engines such as Google or Weotta, to voice interfaces such as Siri or Dragon NaturallySpeaking. Python's Natural Language Toolkit (NLTK) is a suite of libraries that has become one of the best tools for prototyping and building natural language processing systems.

    Python 3 Text Processing with NLTK 3 Cookbook is your handy and illustrative guide, which will walk you through many natural language processing techniques in a step-by-step manner. It will demystify the dark arts of text mining and language processing using the comprehensive Natural Language Toolkit.

    This book cuts short the preamble, ignores pedagogy, and lets you dive right into the techniques of text processing with a practical hands-on approach.

    Get started by learning how to tokenize text into words and sentences, then explore the WordNet lexical dictionary. Learn the basics of stemming and lemmatization. Discover various ways to replace words and perform spelling corrections. Create your own corpora and custom corpus readers, including a MongoDB-based corpus reader. Use part-of-speech taggers to annotate words. Create and transform chunked phrase trees and named entities using partial parsing and chunk transformations. Dig into feature extraction and text classification for sentiment analysis. Learn how to process large amount of text with distributed processing and NoSQL databases.

    This book will teach you all that and more, in a hands-on learn-by-doing manner. Become an expert in using NLTK for Natural Language Processing with this useful companion.

    What this book covers

    Chapter 1, Tokenizing Text and WordNet Basics, covers how to tokenize text into sentences and words, then look up those words in the WordNet lexical dictionary.

    Chapter 2, Replacing and Correcting Words, demonstrates various word replacement and correction techniques, including stemming, lemmatization, and using the Enchant spelling dictionary.

    Chapter 3, Creating Custom Corpora, explains how to use corpus readers and create custom corpora. It also covers how to use some of the corpora that come with NLTK.

    Chapter 4, Part-of-speech Tagging, shows how to annotate a sentence of words with part-of-speech tags, and how to train your own custom part-of-speech tagger.

    Chapter 5, Extracting Chunks, covers the chunking process, also known as partial parsing, which can identify phrases and named entities in a sentence. It also explains how to train your own custom chunker and create specific named entity recognizers.

    Chapter 6, Transforming Chunks and Trees, demonstrates how to transform chunk phrases and parse trees in various ways.

    Chapter 7, Text Classification, shows how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis. It also covers multi-label classification and classifier evaluation metrics.

    Chapter 8, Distributed Processing and Handling Large Datasets, discusses how to use execnet for distributed natural language processing and how to use Redis for storing large datasets.

    Chapter 9, Parsing Specific Data Types, covers various Python modules that are useful for parsing specific kinds of data, such as datetimes and HTML.

    Appendix A, Penn Treebank Part-of-speech Tags, shows a table of Treebank part-of-speech tags, that is a useful reference for Chapter 3, Creating Custom Corpora, and Chapter 4, Part-of-speech Tagging.

    What you need for this book

    You will need Python 3 and the listed Python packages. For this book, I used Python 3.3.5. To install the packages, you can use pip (https://pypi.python.org/pypi/pip/). The following is the list of the packages in requirements format with the version number used while writing this

    Enjoying the preview?
    Page 1 of 1