Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Intelligent Systems for Security Informatics
Intelligent Systems for Security Informatics
Intelligent Systems for Security Informatics
Ebook401 pages4 hours

Intelligent Systems for Security Informatics

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The Intelligent Systems Series comprises titles that present state-of-the-art knowledge and the latest advances in intelligent systems. Its scope includes theoretical studies, design methods, and real-world implementations and applications.

The most prevalent topics in Intelligence and Security Informatics (ISI) include data management, data and text mining for ISI applications, terrorism informatics, deception and intent detection, terrorist and criminal social network analysis, public health and bio-security, crime analysis, cyber-infrastructure protection, transportation infrastructure security, policy studies and evaluation, and information assurance, among others. This book covers the most active research work in recent years.

  • Pulls together key information on ensuring national security around the world
  • The latest research on this subject is concisely presented within the book, with several figures to support the text.
  • Will be of interest to attendees of The Intelligence and Security Informatics conference series, which include IEEE International Conference on Intelligence and Security Informatics (IEEE ISI)
LanguageEnglish
Release dateJan 28, 2013
ISBN9780124059023
Intelligent Systems for Security Informatics

Related to Intelligent Systems for Security Informatics

Related ebooks

Security For You

View More

Related articles

Reviews for Intelligent Systems for Security Informatics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Intelligent Systems for Security Informatics - Christopher C Yang

    Preface

    The Intelligence and Security Informatics conference series, which includes the IEEE International Conference on Intelligence and Security Informatics (IEEE ISI), the European Intelligence and Security Informatics Conference (EISIC), and the Pacific Asia Workshop on Intelligence and Security Informatics (PAISI), started about a decade ago. Since then, it has brought together many academic researchers, law enforcement and intelligence experts, and information technology consultants and experts to discuss their research and practices. The topics in ISI include data management, data and text mining for ISI applications, terrorism informatics, deception and intent detection, terrorist and criminal social network analysis, public health and bio-security, crime analysis, cyber-infrastructure protection, transportation infrastructure security, policy studies and evaluation, and information assurance, among others. In this book, we have covered the most active research work in recent years.

    The intended readership of this book includes (i) public and private sector practitioners in the national/international and homeland security area, (ii) consultants and contractors engaged in ongoing relationships with federal, state, local, and international agencies on projects related to national security, (iii) graduate-level students in Information Sciences, Public Policy, Computer Science, Information Assurance, and Terrorism, and (iv) researchers engaged in security informatics, homeland security, information policy, knowledge management, public administration, and counter-terrorism.

    We hope that readers will find the book valuable and useful in their study or work. We also hope that the book will contribute to the ISI community. Researchers and practitioners in this community will continue to grow and share research findings to contribute to national safety around the world.

    Christopher C. Yang, Drexel University

    Wenji Mao, Chinese Academy of Sciences

    Xiaolong Zheng, Chinese Academy of Sciences

    Hui Wang, National University of Defense Technology

    Chapter 1

    Revealing the Hidden World of the Dark Web

    Social Media Forums and Videos¹

    Hsinchun Chen∗, Dorothy Denning†, Nancy Roberts†, Catherine A. Larson∗, Ximing Yu∗ and Chun-Neng Huang∗, ∗Management Information Systems Department, The University of Arizona, Tucson, Arizona, USA, †Department of Defense Analysis, Naval Postgraduate School, Monterey, California, USA

    Chapter Outline

    1.1 Introduction

    1.2 The Dark Web Forum Portal

    1.2.1 Data Identification and Collection

    1.2.2 Evolution of the Dark Web Forum Portal

    Version 1.0

    Version 2.0

    Version 2.5

    1.2.3 Summary of the Three Versions

    1.2.4 Case Studies using the Dark Web Forum Portal

    Case study I. Dark Forums in Eastern Afghanistan: How to influence the Haqqani audience

    Case study II. Psychological operations

    Conclusion

    1.3 The Video Portal

    1.3.1 System Design

    1.3.2 Data Acquisition

    1.3.3 Data Preparation

    1.3.4 Portal System

    Access control

    Browsing

    Searching

    Post-search filtering

    Multilingual translation

    Social network analysis

    1.4 Conclusion and Future Directions

    Acknowledgments

    References

    1.1 Introduction

    The Internet presence of terrorists, hate groups, and other extremists continues to be of significant interest to counter-terrorism investigators, intelligence analysts, and other researchers in government, industry, and academia, in fields as diverse as: psychology, sociology, criminology, and political science; computational and information sciences; and law enforcement, homeland security, and international policy. Through analysis of primary sources such as terrorists’ own websites, videos, chat sites, and Internet forums, researchers and others attempt, for example, to identify who the terrorists and extremists are, how they are using the Internet and for what intent, who the intended audience is, etc. [1]. For example, the United Nation’s Counter-terrorism Implementation Task Force in 2009 issued a report describing member states’ concerns about continued terrorist use of the Internet for fundraising, recruitment, and cyber attacks, among other things, and analyzed steps to address this use [2]. McNamee et al. [3] examined the message themes found in hate group websites to understand how these groups recruited and reacted to threats through the formation of group identity. Post [4] noted how terrorists had created a virtual community of hatred and wrote of the need to develop a psychology-based counter-terrorism program to, in part, inhibit potential participants from joining, reduce support for these groups, and undermine their activities.

    In 2002, partly in response to burgeoning interest in terrorist use of the Internet, particularly in the aftermath of 9/11, and partly as a natural expansion of its previous work in border security, and information sharing and data mining for law enforcement, the Artificial Intelligence (AI) Lab of the University of Arizona founded its Dark Web project. Dark Web, as it has become known, is a long-term scientific research program that aims to study international terrorism via a computational, data-centric approach (http://ai.arizona.edu/research/).

    Dark Web focuses on the hidden, dark side of the Internet, where terrorists and extremists use the Web to disseminate their ideologies, recruit new members, and even share terrorism training materials. Project goals are twofold: (1) to collect, as comprehensively as possible, all relevant web content generated by international extremist and terrorist groups, including websites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc.; and (2) to develop algorithms, tools, and visualization techniques that will enhance researchers’ and investigators’ abilities to analyze these sites and their relevance, and that are generalizable to and useful across a wide range of domains.

    The next section provides an overview of the genesis and evolution of the Dark Web Forum Portal and includes an examination of the data sources and collection. The following section provides an overview of video portal development. The chapter ends with a conclusion and directions for future work.

    1.2 The Dark Web Forum Portal

    The Dark Web project has for several years collected a wide variety of data related to and emanating from extremist and terrorist groups. These data have included websites, multimedia material linked to the websites, forums, blogs, virtual world implementations, etc. Forums, as dynamic, interactive discussion sites that support online conversations, have proven to be of significant interest. Through the anonymity of posting under screen names, they allow for and support free expression. They are an especially rich source of information for studying organizations and individuals, the evolution of ideas and trends, and other social phenomena. In forums, ongoing conversations are captured in threads, with each thread roughly corresponding to a subject area or topic. The replies, called postings or messages, are generally time-stamped and attributable to a particular poster (author). Analysis of the threads and messages can often reveal dynamic trends in topics and discussions, the sequencing of ideas, and relationships between posters.

    1.2.1 Data Identification and Collection

    The forum sites collected for the Dark Web project were identified with input from terrorism researchers, security and military educators, and other experts. They were selected in part because each is generally dedicated to topics relating to Islamic ideology and theology, and range from moderate to extremist in their opinions and ideologies.

    Once identified, semi-automated methods of collection known as spiders are used to crawl the forums and capture all messages including metadata, such as author (also known as poster), date, and time. The date and time stamps are especially important for helping to maintain the reply network: the order in which messages are posted and replied to. The spiders are described in more detail below.

    The forums were originally collected to serve as a research testbed for use in the Lab, particularly to support work in sentiment and affect analysis, and the study of radicalization processes over time.

    Access to these forums is now provided to researchers and others through the Dark Web Forum Portal [5]. The portal contains approximately 15,000,000 messages in five languages: Arabic, English, French, German, and Russian. The English- and Arabic-language forums selected include major jihadist websites; some of the Arabic forums have English-language sections. Three French forums, and the single forums in German and Russian, provide representative content for extremist groups producing content in these languages. Collectively, the forums have approximately 350,000 members/authors. The portal also provides statistical analysis, download, translation, and social network visualization functions for each selected forum.

    Incremental spidering keeps the content up to date [6]. Tools developed for searching, browsing, translation, analysis, and visualization are described in a later section.

    1.2.2 Evolution of the Dark Web Forum Portal

    Version 1.0

    This section covers the development of the portal and includes references to previous work where certain aspects of the portal research and development are explained in more detail.

    As mentioned above, the Dark Web forums were originally collected to serve as a research testbed for the Artificial Intelligence Lab to develop techniques for analyzing the Internet presence and content of hate and extremist groups (e.g. Refs [7–11]). At the time, little previous research had been done on Dark Web forum data integration and searching. Dark Web forums are heterogeneous, widely distributed, numerous, difficult to access, and can mysteriously appear and disappear with no notice or warning. The growing amount of forum material makes searching increasingly difficult [12]. For researchers interested in analyzing or monitoring Dark Web content, data integration and retrieval are critical issues [10]. Without a centralized system, it is labor-intensive, time-consuming, and expensive to search and analyze Dark Web forum data.

    Two other characteristics of Dark Web forums create barriers to use. The first is the dynamic nature of the forums, which creates difficulties for analyzing and visualizing interactions between participants. Visualization can reveal hitherto hidden relationships and networks behind online activity [13]. Social Network Analysis (SNA) is a graph-based method that can be used to analyze the network structure of a group or population [14]. SNA has been used to study various real-world networks [15]. Web forums are ideal platforms for social network research because by default they record participants’ communications and the postings are retrievable [16]. However, few prior studies had actually incorporated an SNA function into a real-time system.

    A second characteristic is the multilingual nature of the forums. Forums can be found in many of the world’s languages, and forums collected for Dark Web study were in Arabic and English, initially, with French, German, and Russian forums being added later. It was thus critical that the language barrier be addressed.

    Based on the research gaps discussed above, it was clear that a systematic and integrated approach to collecting, searching, browsing, and analyzing Dark Web forum data was needed. We developed these research questions to guide the next steps [5]:

    • Q1: How can we develop a Web portal for Dark Web forums which will effectively integrate data from multiple forum data sources?

    • Q2: How can we develop efficient, accurate search and browse methods across multiple forum data sources in our portal?

    • Q3: How can we incorporate real-time translation functionality into our portal to enable automatic forum data translation from non-English (e.g. Arabic) to English?

    • Q4: How can we incorporate real-time, user-interactive social network analysis into our portal to analyze and visualize the interactions among forum participants?

    The first iteration of the portal was developed based on the system design shown in Figure 1.1.

    Figure 1.1 Early system design of the Dark Web Forum Portal.

    The early system design contained three modules:

    • Data acquisition – Using spidering programs, web pages from the selected online forums were collected. In the first iteration of the portal, we included six Arabic forums and one English-language forum with a total of about 2.3M messages.

    • Data preparation – Using parsing programs, the detailed forum data and metadata were extracted from the raw HTML web pages and stored locally in a database.

    • System functionality – Using Apache Tomcat for the portal and Microsoft SQL Server 2000 for the database, functions including searching and browsing could be supported. Forums could be searched individually or collectively. For forum statistics analysis, Java applet-based charts were created to show the trends based on the numbers of messages produced over time. The multilingual translation function was implemented using the then-current Google Translation API (http://code.google.com/apis/ajaxlanguage/documentation/#Translation). The social network visualization function provided dynamic, user-interactive networks implemented using JUNG (http://jung.sourceforge.net/) to visualize the interactions among forum members.

    Figure 1.2 shows a results screen from a single-forum search using the term bomb in the forum Alokab. Alokab is in Arabic; the search term bomb was used to retrieve matching threads (shown in the middle column, labeled Thread Title), and the translation function was then invoked to translate on the fly from Arabic to English (Thread Title Translation).

    Figure 1.2 Screenshot of single-forum search result.

    An evaluation was conducted with a small group of users, each of whom performed all tasks related to each function. All search tasks were completed successfully on both our portal and a benchmark system; however, on our system, searching was faster. Users also reacted positively to the translation and SNA functions when queried using a seven-point Likert scale to assess their subjective assessments of their overall satisfaction with the portal, including its usefulness and ease of use.

    This first iteration of the portal was created to address the challenges involved in integrating data from multiple forum data sources in multiple languages, developing search and browse methods effective for use across multiple data sources, and incorporating into a portal real-time translation and real-time social network analysis functions that are typically stand-alone. More details about the first version of the system and the user evaluation can be found in Zhang et al. [5].

    Version 2.0

    Version 2.0 was developed with several goals in mind:

    • Increase the scope of data collection while minimizing the amount of human effort or intervention needed.

    • Improve the currency of the data presented in the portal and develop the means to keep it updated in as automated a fashion as possible.

    • Enhance searching and browsing from a user perspective.

    To increase the scope of data collection and keep the collection up to date, we needed to examine our spidering procedures. Spiders [17] are defined as software programs that traverse the World Wide Web information space by following hypertext links and retrieving web documents by standard HTTP protocol. As explained in our previous research, there are six important characteristics of spidering programs: accessibility, collection type, content richness, URL-ordering features, URL-ordering techniques, and collection update procedure [18]. A functional spider program must handle the registration requirement of targeted forums (accessibility), extract the desired information from various data types (collection type), filter out irrelevant file types (content richness), sort queued URLs based on given heuristics (URL-ordering features and techniques), and keep the collection up to date (collection update procedure). An incremental spidering process was added to the data acquisition module of the system [6].

    The addition of the incremental spidering component allowed the portal to stay up to date within 2 weeks of forum postings. It also enabled us to acquire a great many more forums and to increase the collection from seven forums with 2.3M messages in the first version to 29 forums and more than 13M messages in the second. Tests performed during the development of version 2.0 showed, for example, that the incremental spider allowed us to collect 29,000 messages in less than 45 minutes [6].

    Another goal, as listed above, was to improve the searching and browsing experience of users. More flexible Boolean searching was added, to allow users to perform AND and OR searches. Users could also now enter their search terms in English (or any language) and retrieve matches regardless of the original language of the portal. The display was improved to allow users to comprehend, at a glance, how to view, translate, or download results, whether threads or messages.

    Version 2.5

    While version 2.0 addressed many of the issues we identified in usability tests, improvements in searching were still needed. Search is one of the most important and well-used functions in the portal and, as of version 2.0, the search results were still not very satisfactory in the following aspects:

    • Query parsing: While version 2.0 added some Boolean searching capability, it did not support complex, sophisticated queries.

    • Search ranking: The search ranking was problematic when multiple keywords with the OR relationship were entered by users.

    • Hit highlighting: Matched keywords were not always correctly highlighted; some highlighted words did not match the input search terms.

    • Searching efficiency: Searching for messages in more than five forums simultaneously was very slow from a user perspective.

    Given these issues, we embarked on a newer version of the portal, version 2.5, based on version 2.0. We adopted Lucene, a popular Java-based full-text indexing framework for the indexing and searching of thread titles and message contents (http://lucene.apache.org/).

    Features of Lucene include high-performance indexing that scales well, and accurate and efficient search algorithms. Its index size is roughly 20–30% of the size of the text indexed, and Lucene Java is compatible with Lucene implemented in other programming languages. Incremental and batch indexing are both fast. It offers ranked searching in which the best results are returned first, and also offers a wide range of query types.

    Implementing Lucene to work for multilingual searching required analysis before proceeding. The Dark Web Forum Portal (DWFP) contains 29 forums in five languages: English, Arabic, French, German, and Russian. We examined the languages contained in the 29 forums manually and found that among the 17 Arabic forums, 16 are purely in Arabic. An exception was the forum Alqimmah, which contains a considerable number of English messages. All seven English-language forums contain Arabic messages. All French, German, and Russian forums also contain Arabic messages. See Table 1.1 for a listing of

    Enjoying the preview?
    Page 1 of 1