You are on page 1of 18

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information

and Wireless Communications, 2012 (ISSN: 2220-9085)

Session Timeout Thresholds Impact on Quality and Quantity of Extracted Sequence Rules

Martin Drlk and Michal Munk Constantine the Philosopher University in Nitra, Slovak Republic Tr. A. Hlinku 1, 949 74, Nitra mdrlik@ukf.sk, mmunk@ukf.sk KEYWORDS ABSTRACT
The effort of using web usage mining methods in the area of educational data mining is to reveal the knowledge hidden in the log files of the web and database servers of contemporary virtual learning environments. By applying data mining methods to these data, interesting patterns concerning the students behavior can be identified. These methods help us to find the most effective structure of the e-learning courses, optimize the learning content, recommend the most suitable learning path based on students behavior or provide more personalized learning environment. We prepared six datasets of different quality obtained from logs of the virtual learning environment Moodle and pre-processed in different ways. We used three datasets with identified users sessions based on 15, 30 and 60 minute session timeout threshold and three another datasets with the same thresholds including reconstructed paths among course activities. We tried to assess the impact of different session timeout thresholds with or without paths completion on the quantity and quality of the sequence rules that contribute to the representation of the students behavioral patterns in virtual learning environment. The results show that the session timeout threshold has significant impact on quality and quantity of extracted sequence rules. On the contrary, it is shown that the completion of paths has neither significant impact on quantity nor quality of extracted rules. Educational data mining, session timeout threshold, path completion, sequence rules analysis, time window.

1 INTRODUCTION In educational contexts, web usage mining is a part of web data mining that can contribute to finding significant educational knowledge. We can describe it as extracting unknown actionable intelligence from interaction with the elearning environment [1]. Web usage mining was used for personalizing e-learning, adapting educational hypermedia, discovering potential browsing problems, automatic recognition of learner groups in exploratory learning environments or predicting student performance [2]. Analyzing the unique types of data that come from educational systems can help us to find the most effective structure of the e-learning courses, optimize the learning content, recommend the most suitable learning path based on students behavior, or provide more personalized environment. But usually, the traditional e-learning platform does not directly support any of web usage mining methods. Therefore, it is often difficult for educators to obtain useful feedback on students learning experiences or answer the questions how the learners proceed through the learning material and what they gain in 34

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

knowledge from the online courses [3]. We note herein an effort of some authors to design tools that automate typical tasks performed in the pre-processing phase [4] or authors who prepare stepby-step tutorials [5, 6]. Good quality data are a prerequisite for a well-realized data analysis. If there is junk at the input, the same will be at the output, regardless of the method for knowledge extraction used. This applies even more in the area of web log mining, where the log file requires a thorough data preparation. As an example we can present the usage analysis, where we are aimed at finding out what our web visitors are interested in. For this purpose we can use: 1. survey sampling we find out answers to particular items in the questionnaire and a visitor of our site knows that he/she is the object of our survey [7], 2. web log mining we analyse the log file of the web server, which contains information on accesses to the pages of our web, and the visitor does not know that he is the object of our survey. While in case of the survey sampling we can provide good quality data using a reliable and valid measuring procedure for their mining, in case of the web log mining we can provide them through good preparation of data from the log file. The data pre-processing itself represents often the most time consuming phase of the web page analysis [8]. We realized an experiment for purpose to find the an answer to question to what measure it is necessary to execute data pre-processing tasks for gaining valid data from the log files obtained from learning management systems. Specifically, we would like to assess the impact of session timeout

threshold and path completion on the quantity and quality of extracted sequence rules that represent the learners behavioral patterns in a learning management system [9]. We compare six datasets of different quality obtained from logs of the learning management system and preprocessed in different ways. We use three datasets with identified users sessions based on 15, 30 and 60 minute session timeout threshold (STT) and three another datasets with the same thresholds including reconstructed paths among course activities. The rest of the paper is structured subsequently. We summarize related work of other authors who deal with data pre-processing issues in connection with educational systems in the second chapter. Especially, we pay attention to authors who were concerned with the problem of finding the most suitable value of STT for session identification. Subsequently, we particularize research methodology and describe how we prepared log files in different manners in section 3. The section 4 gives the summary of experiment results in detail. Finally, we discuss obtained results and give indication of our future work in section 5. 2 LITERATURE OVERVIEW The aim of the pre-processing phase is to convert the raw data into a suitable input for the next stage mining algorithms [1]. Before applying data mining algorithm, a number of general data pre-processing tasks can be applied. We focus only on data cleaning, user identification, session identification and path completion in this paper. Marquardt et al. [4] published a comprehensive paper about the 35

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

application of web usage mining in the e-learning area with focus on the preprocessing phase. They did not deal with session timeout threshold in detail. Romero et al. [5] paid more attention to data pre-processing issues in their survey. They summarized specific issues about web data mining in learning management systems and provided references about other relevant research papers. Moreover, Romero et al. dealt with some specific features of data preprocessing tasks in LMS Moodle in [5, 10], but they removed the problem of user identification and session identification from their discussion. A user session that is closely associated with user identification is defined as a sequence of requests made by a single user over a certain navigation period and a user may have a single or multiple sessions during this time period. Session identification is a process of segmenting the log data of each user into individual access sessions [11]. Romero et al. argued that these tasks are solved by logging into and logging out from the system. We can agree with them in the case of user identification. In the e-learning context, unlike other web based domains, user identification is a straightforward problem because the learners must login using their unique ID [1]. The excellent review of user identification was made in [3] and [12]. Assuming the user is identified, the next step is to perform session identification, by dividing the click stream of each user into sessions. We can find many approaches to session identification [1317]. In order to determine when a session ends and the next one begins, the session timeout threshold (STT) is often used. A STT is a pre-defined period of inactivity that allows web applications to

determine when a new session occurs [18]. Each website is unique and should have its own STT value. The correct session timeout threshold is often discussed by several authors. They experimented with a variety of different timeouts to find an optimal value [1924]. However, no generalized model was proposed to estimate the STT used to generate sessions [19]. Some authors noted that the number of identified sessions is directly dependent on time. Hence, it is important to select the correct space of time in order for the number of sessions to be estimated accurately [18]. In this paper, we used reactive timeoriented heuristic method to define the users sessions. From our point of view sessions were identified as delimited series of clicks realized in the defined time period. We prepared three different files (A1, A2, A3) with a 15-minute STT (mentioned for example in [25]), 30minute STT [12, 19, 26, 27] and 60minute STT [28] to start a new session with regard to the setting used in learning management system. The analysis of the path completion of users activities is another problem. The reconstruction of activities is focused on retrograde completion of records on the path went through by the user by means of a back button, since the use of such button is not automatically recorded into log entries web-based educational system. Path completion consists of completing the log with inferred accesses. The site topology, represented by sitemap, is fundamental for this inference and significantly contributes to the quality of the resulting dataset, and thus to patterns precision and reliability [4]. The sitemap can be obtained using a crawler. We used the application Web 36

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Crawling implemented in the used Data Miner for the needs of our analysis. Having ordered the records according to the IP address we searched for some linkages between the consecutive pages. We found and analyzed several approaches mentioned in literature [12, 17]. Finally, we chose the same approach as in our previous paper [9]. A sequence for the selected IP address can look like this: ABCDX. In our example, based on the sitemap the algorithm can find out if there not exists the hyperlink from the page D to our page X. Thus we assume that this page was accessed by the user by means of using a Back button from one of the previous pages. Then, through a backward browsing we can find out, where of the previous pages exists a reference to page X. In our sample case, we can find out if there no exists a hyperlink to page X from page C, if C page is entered into the sequence, i.e. the sequence will look like this: ABCDCX. Similarly, we shall find that there exists any hyperlink from page B to page X and can be added into the sequence, i.e. ABCDCBX. Finally algorithm finds out that the page A contains hyperlink to page X and after the termination of the backward path analysis the sequence will look like this: ABCDCBAX. Then it means, the user used Back button in order to transfer from page D to C, from C to B and from B to A [29]. After the application of this method we obtained the files (B1, B2, B3) with an identification of sessions based on user ID, IP address, different timeout thresholds and completing the paths [9].

3 USED METHODOLOGY 3.1 Sequence Rules Sequence rules have been derived from association ones, thus the differences are not so wide [30]. The k sequence is the one of the length k, i. e. it contains k pages. Frequented k sequence is a variation of the frequented k item set, or a combination. Usually, the frequented single-item set is identical with the frequented single sequence. In the following example we shall illustrate differences between the algorithm Apriori and AprioriAll, where the algorithm Apriori serves for the searching of association rules and AprioriAll for the searching of sequence rules: D = {S1 = {U1, <a, b, c>}, S2 = {U2, <a, c>}, S3 = {U1, <b, c, e>}, S4 = {U3, <a, c, d, c, e>}}, where D is a database of transactions with a time label, a, b, c, d, e are web sections and U1, U2, U3 represent users. Each transaction is identified by a user. Let us assume that the minimum support is 30 %. In this case, user U1 has actually two transactions. When searching for sequence rules we consider his sequence to be a current connection of web sections in the transactions S1 and S3, i.e. a sequence can consist of several transactions, while continuous accesses to pages are not required. Similarly support of the sequence is designated not by the percentage of transactions, but by the percentage of users, who own the given sequence. Sequence is large (frequented), if it is at least situated in one sequence identified by the user and meets the condition of minimum support. Set of frequented sequences, which have k items, we mark Lk. For finding Lk, we use set of candidates,

37

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

marks Ck, which involves sequences with k items. The first step is ranking of transactions as to the user with a time label of each page visited by him; the remaining steps are similar to the ones in algorithm Apriori. Having ranked the transactions we obtained current sequences identified by the user, which represent complete references from a single user: D = {S1 = {U1, <a, b, c>}, S3 = {U1, <b, c, e>}, S2 = {U2, <a, c>}, S4 = {U3, <a, c, d, c, e>}}. Similarly to algorithm Apriori we start with generating the set of candidates of length 1: C1 = {<a>, <b>, <c>, <d>, <e>}, from it we define the set L1 = {<a>, <b>, <c>, <d>, <e>} of single-item sequences, where each page is referenced at least by one user. Flowingly, we generate sets of candidates C2 from L1 by means of the so-called full linking, i.e. we make provisions for the web user who searches through the pages forwards or backwards. This is the reason why algorithm Apriori is not suitable for web log mining, by contrast to algorithm

AprioriAll, which respects the above mentioned fact. Out of the set of candidates of length 2: C2 = {<a, b>, <a, c>, <a, d>, <a, e>, <b, a>, <b, c>, <b, d>, <b, e>, <c, a>, <c, b>, <c, d>, <c, e>, <d, a>, <d, b>, <d, c>, <d, e>, <e, a>, <e, b>, <e, c>, <e, d>}, we shall define a set L2 = {<a, b>, <a, c>, <a, d>, <a, e>, <b, c>, <b, e>, <c, b>, <c, d>, <c, e>, <d, c>, <d, e>} of double-item sequences, where each sequence is situated in at least one sequence identified by the user. We then analogically proceed. 3.2 Steps of Experiment We aimed at specifying the inevitable steps that are required for gaining valid data from the log file of learning management system. Specially, we focused on the identification of sessions based on time of various length and reconstruction of student`s activities and influence of interaction of these two steps of data preparation on derived rules. We tried to assess the impact of this

Figure 1. Application of data pre-processing technique on log file obtained from VLE.

38

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

advanced techniques on the quantity and quality of the extracted rules. These rules contribute to the overall representation of the students behavior patterns. The experiment was realized in several steps (Figure 1). First step was Data acquisition. Data acquisition is defining the observed variables into the log file from the point of view of obtaining the necessary data (ID, IP address, URL address, date and time of access, activity, etc.). The second step lead to the creation of data matrices. Data come from the log file (information of accesses) and sitemaps (information on the course contents). The aim of the next step is Data preparation on various levels. We identified of sessions based on 15minute STT (File A1), sessions based on 30-minute STT (File A2), sessions based on 60-minute STT (File A3). Subsequently, we identified sessions based on 15-minute STT and completion of the paths (File B1), sessions based on 30-minute STT and completion of the paths (File B2) and sessions based on 60-minute STT and completion of the paths (File B3). We can name the next step as data analysis. We were searching for behaviour patterns of students in individual files.

We used STATISTICA Sequence, Association and Link Analysis for sequence rules extraction. It is an implementation of algorithm using the powerful Apriori algorithm [31-34] together with a tree structured procedure that only requires one pass through data [35]. After data analysis we tried to understand the output data creation of data matrices from the outcomes of the analysis, defining assumptions. The last step contains comparison of results of data analysis elaborated on various levels of data preparation from the point of view of quantity and quality of the found rules patterns of behaviours of students upon browsing the course. We attended to the comparison of the portion of the rules found in examined files, the comparison of the portion of inexplicable rules in examined files and to the comparison of values of the degree of support and confidence of the found rules in examined files. The contemporary learning management systems store information about their users not in server log file but mainly in relational database. We can find there high extensive log data of the students activities. Learning management systems usually have built-in student monitoring features so they can record any students activity [36].

39

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 1. Number of accesses and sequences in particular files.


File A1 A2 A3 B1 B2 B3 Count of web accesses 70553 70553 70553 75372 75372 75439 Count of costumer's sequences 12992 12058 11378 12992 12058 11378 Count of frequented sequences 71 81 89 73 82 93 Average size of costumer's sequences 5 6 6 6 6 7

The analyzed course consisted of 12 activities and 145 course pages. Students records about their activities in individual course pages in learning management system were observed in the e-learning course in winter term 2010. We used logs stored in relational database of LMS Moodle. LMS Moodle keeps detailed logs of all activities that students perform. It logs every click that students make for navigation in the elearning course [5]. We used records from mdl_log and mdl_log_display tables. These records

contained the entities from the e-learning course with 180 participants. In this phase, log file was cleaned from irrelevant items. First of all, we removed entries of all users with the role other than student. After performing this task, 75 530 entries were accepted to be used in the next task. These records were pre-processed in different manners. In each file, variable Session identifies individual course visit. The variable Session was based on variables User ID, IP address and timeout threshold with selected length (15, 30 and 60-minute STT) in the case

Figure 2. Sequential/Stacked plot for derived rules in examined files.

40

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 2. Incidence of discovered sequence rules in particular files.


Body
course view ...

==>
==> ==> ==> ==> ==>

Head
resource - final test requirements, course view ...

A1
0 ...

A2
1 ...

A3
1 ...

B1
0 ...

B2
1 ...

B3
1 ...

Type of rule
trivial ... inexplicable ... useful

course view ...

view forum about ERD and relation schema ...

0 ...

0 ...

1 ...

0 ...

0 ... 1 81 80.2 19.8

1 ... 1 98 97.0 3.0

course view

view collaborative activities

1 63 62.4 37.6

1 78 77.2 22.8

1 89 88.1 11.9

1 68 67.3 32.7

Count of derived sequence rules Per cent of derived sequence rules (Per cent 1's) Per cent 0's Cochran Q test

Q = 93.84758, df = 5, p < 0.001

of files X1, X2 and X3, where X = {A, B}. The paths were completed for each files BY separately, where Y = {1, 2, 3} based on the sitemap of the course. Compared to the file X1 with the identification of sessions based on 15minute STT (Table 1), the number of visits (costumer's sequences) decreased by approximately 7 % in case of the identification of sessions based on 30minute STT (X2) and decreased by 12.5 % in case of the identification of sessions based on 60-minute STT (X3). On the contrary, the number of frequented sequences increased by 14 % (A2) to 25 % (A3) and in the case of completing the paths increased by 12 % (B2) to 27 % (B3) in examined files. Having completed the paths (Table 1) the number of records increased by almost 7 % and the average length of visit/sequences increased from 5 to 6 (X2) and in the case of the identification of sessions based on 60-minute STT even to 7 (X3). We articulated assumptions: the following

1. we expect that the identification of sessions based on shorter STT will have a significant impact on the quantity of extracted rules in terms of decreasing the portion of trivial and inexplicable rules, 2. we expect that the identification of sessions based on shorter STT will have a significant impact on the quality of extracted rules in the term of their basic measures of the quality, 3. we expect that the completion of paths will have a significant impact on the quantity of extracted rules in terms of increasing the portion of useful rules, 4. we expect that the completion of paths will have a significant impact on the quality of extracted rules in the term of their basic measures of the quality. 4 EXPERIMENT RESULTS We summarize the results of the experiment with regards to the last step of used methodology in the next tree chapters.

41

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 3. Homogeneous groups for incidence of derived rules in examined files: (a) AY; (b) BY.
File A1 Incidence 0.624 1 *** *** *** 0.19459 2 3 File B1 Incidence 0.673 1 *** *** *** 0.19773 2 3

0.772 A2 0.881 A3 Kendall Coefficient of Concordance

0.802 B2 0.970 B3 Kendall Coefficient of Concordance

4.1

Portion of the Found Rules

The analysis (Table 2) resulted in sequence rules, which we obtained from frequented sequences fulfilling their minimum support (in our case min s = 0.02). Frequented sequences were obtained from identified sequences, i.e. visits of individual students during one term. There is a high coincidence between the results (Table 2) of sequence rule analysis in terms of the portion of the found rules in case of files with the identification of sessions based on 30minute STT with and without the paths completion (A2, B2). The most rules were extracted from files with identification of sessions based on 60minute STT; concretely 89 were extracted from the file A3, which represents over 88 % and 98 were extracted from the file B3, which represents over 97 % of the total number of found rules. Generally, more rules

were found in the observed files with the completion of paths (BY). Based on the results of Q test (Table 2), the zero hypothesis, which reasons that the incidence of rules does not depend on individual levels of data preparation for web log mining, is rejected at the 1 % significance level. Kendalls coefficient of concordance represents the degree of concordance in the number of the found rules among examined files. The value of coefficient (Table 3) is approximately 0.19 in both groups (AY, BY), while 1 means a perfect concordance and 0 represents discordance. Low values of coefficient confirm Q test results. From the multiple comparisons (Tukey HSD test) was not identified homogenous group (Table 3) in term of the average incidence of the found rules. Statistically significant differences were proved on the level of significance 0.05 in the average incidence of found rules among all examined files (X1, X2, and X3).

Table 4. Crosstabulations AY x BY: A1 x B1.


A1/B1 0 1 McNemar (B/C) 0 33 32.67% 0 0.00% 33 32.67% 1 5 4.95% 63 62.38% 68 67.33% 38 37.62% 63 62.38% 101 100%

Chi2 = 3.2, df = 1, p = 0.0736

42

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 5. Crosstabulations AY x BY: A2 x B2.


A2\B2 0 1 McNemar (B/C) 0 19 18.81% 1 0.99% 20 19.80% 1 4 3.96% 77 76.24% 81 80.20% 23 22.77% 78 77.23% 101 100%

Chi2 = 0.8, df = 1, p = 0.3711

The value of STT has an important impact on the quantity of extracted rules (X1, X2, and X3) in the process of session identification based on time. If we have a look at the results in details (Table 4-6), we can see that in the files with the completion of the paths (BY) were found identical rules to the files without completion of the paths (AY), except one rule in case of files with 30minute STT (X2) and three rules in case of the files with 60-minute STT (X3). The difference consisted only in 4 to 12 new rules, which were found in the files with the completion of the paths (BY). In case of the files with 15 and 30minute STT (B1, B2) the portion of new files represented 5 % and 4 %. In case of the file with 60-minute STT (B3) almost 12 %, where also the statistically significant difference (Table 6) in the

number of found rules between A3 and B3 in favor of B3 was proved. The completion of the paths has an impact on the quantity of extracted rules only in case of files with the identification of sessions based on 60minute timeout (A3 vs. B3). On the contrary, making provisions for the completion of paths in case of files with the identification of sessions based on shorter timeout has no significant impact on the quantity of extracted rules (X1, X2). 4.2 Portion of Inexplicable Rules Now, we will look at the results of sequence analysis more closely, while taking into consideration the portion of each kind of the discovered rules. We require from association rules that they be not only clear but also useful.

Table 6. Crosstabulations AY x BY: A3 x B3.


A3\B3 0 A3\B3 1 McNemar (B/C) 0 0 0.00% 3 2.97% 3 2.97% 1 12 11.88% 86 85.15% 98 97.03% 12 11.88% 89 88.12% 101 100%

Chi2 = 4.3, df = 1, p = 0.0389

43

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 7. Crosstabulations - Incidence of rules x Types of rules: A1.


A1\Type 0 1 Pearson Con. Coef. C Cramr's V useful 2 9.52% 19 90.48% 21 100% trivial 32 42.67% 43 57.33% 75 100% Chi2 = 11.7, df = 2, p = 0.0029 0.32226 0.34042 inexplicable 4 80.00% 1 20.00% 5 100%

Association analysis produces the three common types of rules [37]: the useful (utilizable, beneficial), the trivial, the inexplicable.

In our case upon sequence rules we will differentiate same types of rules. The only requirement (validity assumption) of the use of chi-square test is high enough expected frequencies [38]. The condition is violated if the expected frequencies are lower than 5. The validity assumption of chi-square test in our tests is violated. This is the reason why we shall not prop ourselves only upon the results of Pearson chi-square test, but also upon the value of

calculated contingency coefficient. Contingency coefficients (Coef. C, Cramr's V) represent the degree of dependency between two nominal variables. The value of coefficient (Table 7) is approximately 0.34. There is a medium dependency among the portion of the useful, trivial and inexplicable rules and their occurrence in the set of the discovered rules extracted from the data matrix A1, the contingency coefficient is statistically significant. The zero hypothesis (Table 7) is rejected at the 1 % significance level, i.e. the portion of the useful, trivial and inexplicable rules depends on the identification of sessions based on 15minute STT. In this file were found the

Table 8. Crosstabulations - Incidence of rules x Types of rules: A2.


A2\Type 0 1 Pearson Con. Coef. C Cramr's V useful 1 4.76% 20 95.24% 21 100% trivial 19 25.33% 56 74.67% 75 100% Chi2 = 8.1, df = 2, p = 0.0175 0.27237 0.28308 inexp. 3 60.00% 2 40.00% 5 100%

44

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 9. Crosstabulations - Incidence of rules x Types of rules: A3.


A3\Type 0 1 Pearson Con. Coef. C Cramr's V useful 0 0.00% 21 100.00% 21 100% trivial 11 14.67% 64 85.33% 75 100% Chi2 = 3.7, df = 2, p = 0.1571 0.18804 0.19145 inexp. 1 20.00% 4 80.00% 5 100%

least trivial and inexplicable rules, while 19 useful rules were extracted from the file (A1), which represents over 90 % of the total number of the found useful rules. The value of coefficient (Table 8) is approximately 0.28, while 1 means perfect relationship and 0 no relationship. There is a little dependency among the portion of the useful, trivial and inexplicable rules and their occurrence in the set of the discovered rules extracted from the data matrix File A2, the contingency coefficient is statistically significant. The zero hypothesis (Table 8) is rejected at the 5 % significance level, i.e. the portion of

the useful, trivial and inexplicable rules depends on the identification of sessions based on 30-minute timeout. The coefficient value (Table 9) is approximately 0.19, while 1 represents perfect dependency and 0 means independency. There is a little dependency among the portion of the useful, trivial and inexplicable rules and their occurrence in the set of the discovered rules extracted from the data matrix File A3, and the contingency coefficient is not statistically significant. In this file were found the most trivial and inexplicable rules, while portion of useful rules did not significantly increased.

45

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Almost identical results were achieved for files with completion of the paths, too (Table 10). Similarly, the portion of useful, trivial and inexplicable rules is also approximately equal in case of files A1, B1 and files A2, B2. It corresponds with results from previous chapter (chapter 4.1), where were not proved significant differences in number of the discovered rules between files A1, B1 and files A2, B2. On the contrary, there was statistically significant difference (Table 6) between A3 and B3 in favor of B3. If we have a look at the differences between A3 and B3 in dependency on types of rule (Table 9, Table 10c), we observe increase in number of trivial and inexplicable rules in case B3, while the portion of useful rules is equal in both files. The portion of trivial and inexplicable rules is dependent from the length of timeout by the identification of sessions based on time and independent from reconstruction of student`s activities in case of the identification of sessions based on 15-minute and 30-minute STT. Completion of paths has not impact on increasing portion of useful rules. On the

contrary, impropriate chosen timeout may cause increasing of trivial and inexplicable rules. 4.3 The Values of Support and Confidence Rates of the Found Rules Quality of sequence rules is assessed by means of two indicators [37]: support, confidence. Results of the sequence rule analysis showed differences not only in the quantity of the found rules, but also in the quality. Kendalls coefficient of concordance represents the degree of concordance in the support of the found rules among examined files. The value of coefficient (Table 11a) is approximately 0.89, while 1 means a perfect concordance and 0 represents discordancy. From the multiple comparison (Tukey HSD test) five homogenous groups (Table 11a) consisting of examined files were identified in term of the average support of the found rules. The first homogenous group consists of files A1, B1, the third of files A2, B2 and the fifth

Table 10. Crosstabulations - Incidence of rules x Types of rules: (a) B1; (b) B2; (c) B3. (U - useful, T trivial, I inexplicable rules. C - Contingency coefficient, V - Cramr's V.)
B1\ Type 0 1 Pear. C V B2\ Type 0 1 Pear. C V B3\ Type 0 1 Pear. C V

U 2 9.5% 19 90.5% 21

T 27 36.0% 48 64.0% 75

I 4 80.0% 1 20.0% 5

U 2 9.5% 19 90.5% 21

T 15 20.0% 60 80.0% 75

I 3 60.0% 2 40.0% 5

U 0 0.0% 21 100.0% 21

T 3 4.0% 72 96.0% 75

I 0 0.0% 5 100.0% 5

100% 100% 100% Chi2 = 10.6, df = 2, p = 0.0050 0.30798 0.32372

100% 100% 100% Chi2 = 6.5, df = 2, p = 0.0390 0.24565 0.25342

100% 100% 100% 2 Chi = 1.1, df = 2, p = 0.5851 0.10247 0.10302

46

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

Table 11. Homogeneous groups for (a) support of derived rules; (b) confidence of derived rules.
(a) File Support 4.330 A1 4.625 B1 4.806 A2 5.104 B2 5.231 A3 5.529 B3 Kendall Coefficient of Concordance (b) File Support 26.702 A1 27.474 B1 27.762 A2 28.468 B2 28.833 A3 29.489 B3 Kendall Coefficient of Concordance 1 **** **** 2 **** **** 3 4 5

**** ****

**** ****

**** ****

0.88778 1 **** **** 2 **** **** 3 4 5

**** ****

**** ****

**** ****

0.78087

of files A3, B3. Between these files is not statistically significant difference in support of discovered rules. On the contrary, statistically significant differences on the level of significance 0.05 in the average support of found rules were proved among files A1, A2, A3 and among files B1, B2, B3. There were demonstrated differences in the quality in terms of confidence characteristics values of the discovered rules among individual files. The coefficient of concordance values (Table 11b) is almost 0.78, while 1 means a perfect concordance and 0 represents discordancy. From the multiple comparison (Tukey HSD test) five homogenous groups (Table 11b) consisting of examined files were identified in term of the average confidence of the found rules. The first homogenous group consists of files A1, B1, the third of files A2, B2 and the fifth of files A3, B3. Between these files is not statistically significant difference in confidence of discovered rules.

On the contrary, statistically significant differences on the level of significance 0.05 in the average confidence of found rules were proved among files A1, A2, A3 and among files B1, B2, B3. Results (Table 11a, Table 11b) show that the largest degree of concordance in the support and confidence is among the rules found in the file without completing paths (AY) and in corresponding file with completion of the paths (BY). On the contrary, discordancy is among files with various timeout (X1, X2, X3) in both groups (AY, BY). Timeout by identification of sessions based on time has a substantial impact on the quality of extracted rules (X1, X2, X3). On the contrary, completion of the paths has not any significant impact on the quality of extracted rules (AY, BY). 5 CONCLUSIONS The first assumption concerning the identification of sessions based on time and its impact on quantity of extracted 47

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

rules was fully proved. Specifically, it was proved that the length of STT has an important impact on the quantity of extracted rules. Statistically significant differences in the average incidence of found rules were proved among files A1, A2, A3 and among files B1, B2, B3. The portion of trivial and inexplicable rules is dependent from STT. Identification of sessions based on shorter STT has impact on decreasing portion of trivial and inexplicable rules. The second assumption concerning the identification of sessions based on time and its impact on quality of extracted rules in term of their basic measures of quality was also fully proved. Similarly it was proved that shorter STT has a significant impact on the quality of extracted rules. Statistically significant differences in the average support and confidence of found rules were proved among files A1, A2, A3 and among files B1, B2, B3. On the contrary, it was showed that the completion of paths has neither significant impact on quantity nor quality of extracted rules (AY, BY). Completion of paths has not impact on increasing portion of useful rules. The completion of the path has an impact on the quantity of extracted rules only in case of files with identification of sessions based on 60-minute STT (A3 vs. B3), while the portion of trivial and inexplicable rules was increasing. Completion of paths by the impropriate chosen STT may cause increasing of trivial and inexplicable rules. Results show that the largest degree of concordance in the support and confidence is among the rules found in the file without completion of the paths (AY) and in corresponding file with the completion of paths (BY). The third and fourth assumptions were not proved.

From the previous follows, that the statement of several researchers about the number of identified sessions is dependent on time was proven. Experiment`s results showed that this dependency is not simple. The wrong STT choice could lead to the increasing of trivial and especially inexplicable rules. Our research indicates that it is possible to reduce the complexity of preprocessing phase in case of using web usage methods in educational context. We suppose that if the structure of elearning course is relatively rigid and LMS provides sophisticated possibilities of navigation, the task of path completion can be removed from the pre-processing phase of web data mining because it has not significant impact on the quantity and quality of extracted knowledge. We also suppose that the results of our experiment markedly influenced using modern forms of navigation in the elearning course. It is called breadcrumbs and they are available in each e-learning course of the LMS Moodle, which was used in our experiment. We assume that the breadcrumbs eliminate using of the back button of the web browser. We further assume that the path reconstruction Experiment has several weak places. At first, we have to notice that the experiment was realized based on data obtained from one e-learning course. Therefore, the obtained results could be misrepresented by course structure and used teaching methods. For generalization of the obtained findings, it would be needs to repeat the proposed experiment based on data obtained from several e-learning courses with various structures and/or various 48

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

using of learning activities supporting course. We would like to concentrate on further comprehensive work on generalization of presented methodology and increasing the data reliability used in experiment. We plan to repeat and improve proposed methodology to accumulate evidence in the future. Furthermore, we intend to investigate the ways of integration of path completion mechanism used in our experiment into the contemporary LMSs, or eventually in standardized web servers. 6 REFERENCES

7.

8.

9.

10. 1. Ba-Omar, H., Petrounias, I., Anwar, F.: A Framework for Using Web Usage Mining to Personalise E-learning. Advanced Learning Technologies, 2007. ICALT 2007. Seventh IEEE International Conference on (2007) 937-938 Crespo Garcia, R.M., Kloos, C.D.: Web Usage Mining in a Blended Learning Context: A Case Study. Advanced Learning Technologies, 2008. ICALT '08. Eighth IEEE International Conference on (2008) 982-984 Chitraa, V., Davamani, A.S.: A Survey on Preprocessing Methods for Web Usage Data. International Journal of Computer Science and Information Security 7 (2010) Marquardt, C.G., Becker, K., Ruiz, D.D.: A Pre-processing Tool for Web Usage Mining in the Distance Education Domain. Database Engineering and Applications Symposium, 2004. IDEAS '04. Proceedings. International (2004) 78-87 Romero, C., Ventura, S., Garcia, E.: Data Mining in Course Management Systems: Moodle Case Study and Tutorial. Comput. Educ. 51 (2008) 368384 Falakmasir, M.H., Habibi, J.: Using Educational Data Mining Methods to Study the Impact of Virtual Classroom in E-Learning. In: Baker, R.S.J.d., Merceron, A., Pavlik, P.I.J. (eds.): 3rd

11.

2.

12.

3.

13.

4.

14.

5.

15.

6.

16.

International Conference on Educational Data Mining, Pittsburgh (2010) 241-248 Cpay, M., Balogh, Z., Boledoviov, M., Mesroov, M.: Interpretation of questionnaire survey results in comparison with usage analysis in elearning system for healthcare. DICTAP 2011, Communications in Computer and Information Science 167 CCIS (PART 2), pp. 504-516 Bing, L.: Web Data Mining. Exploring Hyperlinks, Contents and Usage Data. Springer (2006) Munk, M., Kapusta, J., Svec, P.: Data Pre-processing Evaluation for Web Log Mining: Reconstruction of Activities of a Web Visitor. Procedia Computer Science 1 (2010) 2273-2280 Romero, C., Espejo, P.G., Zafra, A., Romero, J.R., Ventura, S.: Web Usage Mining for Predicting Final Marks of Students that Use Moodle Courses. Computer Applications in Engineering Education (2010) 26 Raju, G.T., Satyanarayana, P.S.: Knowledge Discovery from Web Usage Data: a Complete Preprocessing Methodology. IJCSNS International Journal of Computer Science and Network Security 8 (2008) Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis. INFORMS J. on Computing 15 (2003) 171-190 Bayir, M.A., Toroslu, I.H., Cosar, A.: A New Approach for Reactive Web Usage Data Processing. Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on (2006) 4444 Zhang, H., Liang, W.: An Intelligent Algorithm of Data Pre-processing in Web Usage Mining. Proceedings of the World Congress on Intelligent Control and Automation (WCICA) (2004) 3119 - 3123 Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems 1 (1999) 5-32 Yan, L., Boqin, F., Qinjiao, M.: Research on Path Completion

49

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

Technique in Web Usage Mining. Computer Science and Computational Technology, 2008. ISCSCT '08. International Symposium on, Vol. 1 (2008) 554-559 Yan, L., Boqin, F.: The Construction of Transactions for Web Usage Mining. Computational Intelligence and Natural Computing, 2009. CINC '09. International Conference on, Vol. 1 (2009) 121-124 Huynh, T.: Empirically Driven Investigation of Dependability and Security Issues in Internet-Centric Systems. Department of Electrical and Computer Engineering. University of Alberta, Edmonton (2010) Huynh, T., Miller, J.: Empirical Observations on the Session Timeout Threshold. Inf. Process. Manage. 45 (2009) 513-528 Catledge, L.D., Pitkow, J.E.: Characterizing Browsing Strategies in the World-Wide Web. Comput. Netw. ISDN Syst. 27 (1995) 1065-1073 Huntington, P., Nicholas, D., Jamali, H.R.: Website Usage Metrics: A Reassessment of Session Data. Inf. Process. Manage. 44 (2008) 358-372 Meiss, M., Duncan, J., Goncalves, B., Ramasco, J.J., Menczer, F.: What's in a Session: Tracking Individual Behavior on the Web. Proceedings of the 20th ACM conference on Hypertext and hypermedia. ACM, Torino, Italy (2009) Huang, X., Peng, F., An, A., Schuurmans, D.: Dynamic Web Log Session Identification with Statistical Language Models. J. Am. Soc. Inf. Sci. Technol. 55 (2004) 1290-1303 Goseva-Popstojanova, K., Mazimdar, S., Singh, A.D.: Empirical Study of Session-Based Workload and Reliability for Web Servers. Proceedings of the 15th International Symposium on Software Reliability Engineering. IEEE Computer Society (2004) Tian, J., Rudraraju, S., Zhao, L.: Evaluating Web Software Reliability Based on Workload and Failure Data Extracted from Server Logs. Software Engineering, IEEE Transactions on 30 (2004) 754-769 Chen, Z., Fowler, R.H., Fu, A.W.-C.: Linear Time Algorithms for Finding

27.

28.

29.

30.

31.

32.

33.

34.

35. 36.

Maximal Forward References. Proceedings of the International Conference on Information Technology: Computers and Communications. IEEE Computer Society (2003) Borbinha, J., Baker, T., Mahoui, M., Jo Cunningham, S.: A Comparative Transaction Log Analysis of Two Computing Collections. Research and Advanced Technology for Digital Libraries, Vol. 1923. Springer Berlin / Heidelberg (2000) 418-423 Kohavi, R., Mason, L., Parekh, R., Zheng, Z.: Lessons and Challenges from Mining Retail E-Commerce Data. Mach. Learn. 57 (2004) 83-113 Munk, M., Kapusta, J., vec, P., Turni, M.: Data Advance Preparation Factors Affecting Results of Sequence Rule Analysis in Web Log Mining. E+M Economics and Management 13 (2010) 143-160 Wang, T., He, P. : Web Log Mining by an Improved AprioriAll Algorithm. Engineering and Technology, 2005, No. 4, pp. 97-100. Agrawal, R., Imieliski, Swami, A.: Mining Association Rules Between Sets of Items in Large Databases. Proceedings of the 1993 ACM SIGMOD international conference on Management of data. ACM, Washington, D.C., United States (1993) Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc. (1994) Han, J., Lakshmanan, L.V.S., Pei, J.: Scalable Frequent-pattern Mining Methods: an Overview. Tutorial notes of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, San Francisco, California (2001) Witten, I., H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, New York (2000) Electronic Statistics Textbook. StatSoft, Tulsa (2010) Romero, C., Ventura, S.: Educational Data Mining: A Survey from 1995 to 2005. Expert Systems with Applications 33 (2007) 135-146

50

International Journal on New Computer Architectures and Their Applications (IJNCAA) 2(1): 34-51 The Society of Digital Information and Wireless Communications, 2012 (ISSN: 2220-9085)

37. Berry, M.J., Linoff, G.S.: Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. Wiley Publishing, Inc. (2004) 38. Hays, W.L.: Statistics. CBS College Publishing, New York: (1988)

51