SMU MSCIT 4th SEM Assignment

Q 1. Define OLTP? Explain the Differences between OLTP and Data Warehouse. Ans.
: OLTP is an Online Transaction Processing System to handle day-to-day business transactions. Examples are EBay, Amazon, Ticket Reservation Systems etc. Online Transaction processing systems are the backbone of an organization because they update data base constantly. At any given moment, if someone needs an inventory balance, an account balance or the total current value of a financial portfolio, the OLTP provides it. The OLTP market is a demanding one, often requiring 24x7 operations. Every business has to deal with some form of transactions. How a company decides to manage these transactions can be an important factor in its success. As a business grows, its number of transactions usually grows as well. Careful planning must be done in order to ensure that transaction management does not become too complex. Transaction processing is a tool that can help growing businesses deal with their increasing number of transactions. Differences between OLTP and Data Warehouse Application databases are OLTP (On-Line Transaction Processing) systems where every transaction has to be recorded as and when it occurs. A Data Warehouse, is a secondary database that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read-only data that can be queried and analyzed far more efficiently as compared to your regular OLTP application databases. In this sense an OLAP system is designed to be readoptimized. Separation from your application database also ensures that your business intelligence solution is scalable, better documented and managed. Creation of a DW leads to a direct increase in quality of analysis as the table structures are simpler (you keep only the needed information in simpler tables), standardized (welldocumented table structures), and often de-normalized (to reduce the linkages between tables and the corresponding complexity of queries). Having a well-designed DW is the foundation for successful BI (Business Intelligence)/Analytics initiatives, which are built upon. Data Warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction. Property Nature of Data Warehouse Indexes Joins Duplicate data Aggregate data Queries Nature of queries Updates Historical data OLTP 3 NF Few Many Normalized Rare Mostly predefined Mostly complex All the time Often not available Data Warehouse Multidimensional Many Some Demoralized Common Mostly adhoc Mostly simple Not allowed, only refreshed Essential
Q 2. What are the major components of DWH Architecture Ans.: The major components of DWH Architecture are: Source Data Component Production Data: This category of data comes from the various operational systems of the enterprise. This is the data from many vertical applications. The challenge is to standardize and transform the disparate data from the various production systems, convert the data, and integrate the pieces into useful data for storage in the Data Warehouse. Internal Data: In every organization, users keep their private spreadsheets, documents, customer profiles, and sometimes even departmental databases. This is the internal data, parts of which could be useful for Data Warehouse for analysis. Archived Data: Operational systems are primarily intended to run the current business. In every operational system, you periodically take the old data and store it in archived files. External Data: Most executives depend on data from external sources for a high percentage of the information they use. They use statistics relating to their industry produced by external agencies. They use market share data of competitors. They use standard values of financial indicators for their business to check on their performance. Data Staging Component: Three major functions need to be performed for getting the data ready. You have to extract the data, transform the data, and then load the data into the Data Warehouse storage. These three major functions of extraction, transformation, and preparation for loading take place in a staging area. Data Storage Component: The data storage for the Data Warehouse is a separate repository. The Data Warehouses are read-only data repositories. Many of the Data Warehouses also employ multidimensional database management systems. Information delivery component: There are Online, Intranet, Internet, and email methods are using to deliver the warehouse information. Most commonly, you provide for online queries and reports. The users will enter their requests online and will receive the results online. You may set up delivery of scheduled reports through e-mail. Metadata Component: Metadata in a Data Warehouse is similar to the data dictionary or the data catalog in a Database Management System. Management and Control Component: This component of the Data Warehouse architecture sits on top of all the other components. The management and control component coordinates the services and activities within the Data Warehouse. Q 3. What is OLAP? Explain. Ans.: On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. OLAP allows business users to slice and dice data at will. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. Making data
compatible involves ensuring that the meaning of the data in one repository matches all other repositories. OLAP systems are market-oriented and used for data analysis by knowledge Workers, including managers, executives, and analysts. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making. An OLAP system typically adopts either a star or snowflake model and a subject oriented database design. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.
Advantages of OLAP Faster delivery of applications following from the previous benefits More efficient operations through reducing time on query executions Inherent flexibility of OLAP systems means that users may be self-sufficient in running their own analysis without IT assistance. Ability to generate dynamic reports. Self-sufficiency of users, resulting in reduction in backlog traffic Ability to model real-world challenges with business metrics and dimensions.
Q 4. What is Data mining? Explain. Ans.: Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious, but useful information from large databases. Data mining is the efficient discovery of valuable, non-obvious information from a large collection of data. Knowledge discovery in databases is the non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in the data. It is the automatic discovery of new facts and relationships in data that are like valuable nuggets of business data. It is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions.
Data mining streamlines the transformation of masses of information into meaningful knowledge, which is essential or bottom-line of Business intelligence. Typical techniques for data mining involve decision trees, neural networks, nearest neighbor clustering, fuzzy logic, and Genetic algorithms. Although data mining is still in its infancy, companies in a wide range of industries including finance, health care, manufacturing, transportation are already using data mining tools and techniques to take advantage of historical data.
The whole logic of data mining is based on modeling. Modeling is simply the act of building a model based on data from situations where the answer is known and then applying the model to other situations where the answers are not known. Q 5. What are the Objectives of using data mining in business? Ans.: Association Classification Regression Clustering Neural Networks The basic premise of an association is to find all associations, such that the presence of one set of items in a transaction implies the other items. Classification develops profiles of different groups. Sequential patterns identify sequential patterns subject to a user-specified minimum constraint. Clustering segments a database into subsets. Association The goal of the techniques described in this section is to detect relationships or associations between specific values of categorical variables in large data sets. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. Association rules mining has many applications other than market basket analysis, including applications in marketing, customer segmentation, medicine, electronic commerce, bioinformatics and finance. Market basket analysis is just one form of association rule mining. In fact, there are many kinds of association rules. Classification Classification is a Data Mining (machine learning) technique used to predict group membership for data instances. Regression and Classification are two of the more popular Classification Techniques. Classification involves finding rules that partition the data into disjoint groups. The input for the classification is the training data set, whose class labels are already known. Classification analyzes the training data set and constructs a model based on the class label, and aims to assign a class label to the future unlabeled records. Since the class field is known, this type of classification is known as supervised learning. A set of classification rules are generated by such a classification process, which can be used to classify future data and develop a better understanding of each class in the database. Regression Regression is the oldest and most well-known statistical technique that the Data Mining community utilizes. When you're ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you've got a prediction. The major limitation of this technique is that it only works well only with continuous quantitative data. Clustering Clustering is a method of grouping data into different groups, so that the data in each group share similar trends and patterns. Clustering constitutes a major class of data mining algorithms. The algorithm attempts to automatically partition the data space into a set of regions or clusters, to which
the examples in the table are assigned, either deterministically or probability-wise. The goal of the process is to identify all sets of similar examples in the data, in some optimal fashion. Neural networks An Artificial Neural Network (ANN) is an information-processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. Neural Networks are made up of many artificial neurons. An artificial neuron is simply an electronically modeled biological neuron. How many neurons are used depends on the task at hand. It could be as few as three or as many as several thousands. There are different types of Neural Networks, each of which has different strengths particular to their applications. The abilities of different networks can be related to their structure, dynamics and learning methods. Q 6. What is Clustering? Explain in detail. Ans.: Clustering is the method by which like records are grouped together. Usually this is done to give the end user a high level view of what is going on in the database. Clustering is sometimes used to mean segmentation which most marketing people will tell you is useful for coming up with a birds eye view of the business. Clustering is a form of learning by observation rather than learning by examples. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common characteristics. These can be used to help you understand the business better and also exploited to improve future performance through predictive analytics. Cluster Analysis Cluster analysis is an exploratory data analysis tool for solving classification problems. Its objective is to sort cases (people, things, events, etc.) into groups, or clusters, so that the degree of association is strong between members of the same cluster and weak between members of different clusters. Formulate hypotheses concerning the origin of the sample, e.g. in evolution studies. Describe a sample in terms of a typology, e.g. for market analysis or administrative purposes. Predict the future behavior of population types Optimize functional processes, e.g. business site locations or product design. Assist in identification, e.g. in diagnosing diseases. Measure the different effects of treatments on classes within the population, e.g. with analysis of variance. There are four methods of clustering. They are K-means Hierarchical Agglomerative Divisive K-means The basic step of k-means clustering is simple. In the beginning we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroids.
Then the K means algorithm will do the three steps given below until convergence iterate until stable 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroids 3. Group the object based on minimum distance Hierarchical Clustering In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings. Agglomerative Clustering The K-means approach to clustering starts out with a fixed number of clusters and allocates all records into exactly that number of clusters. Another class of methods works by agglomeration. These methods start out with each data point forming its own cluster and gradually merge them into larger and larger clusters until all points have been gathered together into one big cluster. Toward the beginning of the process, the clusters are very small and very pure. The members of each cluster are few and closely related. Towards the end of the process, the clusters are large and not as well defined. The entire history is preserved making it possible to choose the level of clustering that works best for a given application. Divisive Clustering Cluster hierarchy can also be generated top-down. This variant of hierarchical clustering is called topdown clustering or divisive clustering. We start at the top with all documents in one cluster. The cluster is split using a flat clustering algorithm. This procedure is applied recursively until each document is in its own singleton cluster. Top-down clustering is conceptually more complex than bottom-up clustering since we need a second, flat clustering algorithm as a subroutine. It has the advantage of being more efficient if we do not generate a complete hierarchy all the way down to individual document leaves. For a fixed number of top levels, using an efficient flat algorithm like K-means, top-down algorithms are linear in the number of documents and clusters.
Q 7. Define the following terminologies (a) User (b) page view Ans.: A user is defined as a single individual who is accessing files from one or more Web servers through a browser. A page view is the aggregation of all the files that are displayed on the user's screen at any one single point. A page view typically consists of several files at any one time - frames, text, graphics, multi-media, etc. It is the "Web page" that the user requests and views.
Q 8. What are the Web content mining problems/challenges? Explain. Ans.: Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages. Web content mining could be differentiated from two points of view: Agent based approach or Database approach. The first approach aims on improving the information finding and filtering. The second approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it. Web content mining problems/challenges: Data/Information extraction: Extraction of structured data from Web pages, such as products and search results is a difficult task. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are used to solve this problem. Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. Identifying or matching semantically similar data is a very important problem with many practical applications. Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain. Segmenting web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem.

SMU MSCIT 4th SEM Assignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SMU MSCIT 4th SEM Assignment

Uploaded by

Copyright:

Available Formats

Q 1. Define OLTP? Explain the Differences between OLTP and Data Warehouse. Ans.

You might also like