Abstract
In recent times, the urge to collect data and analyze it has grown. Time stamping a data set is an important part of the analysis and data mining as it can give information that is more useful. Different mining techniques have been designed for mining time-series data, sequential patterns for example seeks relationships between occurrences of sequential events and finds if there exist any specific order of the occurrences. Many Algorithms has been proposed to study this data type based on the apriori approach. In this paper we compare two basic sequential algorithms which are General Sequential algorithm (GSP) and Sequential PAttern Discovery using Equivalence classes (SPADE). These two algorithms are based on the Apriori algorithms. Experimental results have shown that SPADE consumes less time than GSP algorithm.
Author Contributions
Copyright© 2021
KRIBII Rajae, et al.
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Competing interests The authors declare that they have no conflict of interest.
Funding Interests:
Citation:
Introduction
Data mining is a critical step in the field of data science. It involves sorting large data sets to identify patterns and build relationships to solve problems through data analysis. Data mining tools help companies predict future trends. Specifically, data mining benefits vary depending on the goal and the industry. Sales and marketing departments can mine customer data to improve lead conversion rates or to create one-to-one marketing campaigns. Data mining information on historical sales patterns and customer behaviors can be used to build prediction models for future sales, new products and services. Data mining parameters include classification, sequence or path analysis, clustering and forecasting. Sequence or path parsing parameters look for patterns in which one event leads to another subsequent event. A sequence is an ordered list of element sets and is a common type of data structure in many databases. A classification parameter searches for new models and may cause a change in the organization of the data. Classification algorithms predict variables based on other factors in the database. In data mining, association rules are created by analyzing the data to look for frequent "if / then" patterns, and then using the Sequence Mining was first introduced in 1995 by Argwal and Srikant Various algorithms have been implemented to identify the frequent sequences from the sequence database. One of the approaches used to identify frequent sequences is the Apriori approach This paper is organized as follows: in the second section, we describe the Generalized Sequential Models (GSP) algorithm, which use a horizontal format database. In the third section a Sequential Pattern Discovery using Equivalence classes (SPADE) which uses a vertical database is given. A case study is given in section four. The paper is ended by experimental results and conclusion. The Generalized Sequential Models (GSP) algorithm is an apriori-based algorithm applied to sequential models. It integrates with time constraints and relaxes the definition of transaction. For time constraints, maximum gap and minimal gap are defined to specify the gap between any two adjacent transactions in the sequence. If the distance is not in the range between maximum gap and minimal gap then this two cannot be taken as two consecutive transactions in a sequence. The transactions are then applied to generate multiple level sequential patterns to find all sequences whose support is greater than the user-defined minimum support GSP uses the Apriori property F1 = the set of frequent 1-sequence K = 2, do while Fk-1 != Null; Generate candidate sets Ck (set of candidate k-sequences); For all input sequences s in the database D do Increment count of all a in Ck if s supports a End do Fk = {a ∈ Ck such that its frequency exceeds the threshold} k = k+1; End do Result = Set of all frequent sequences is the union of all Fk' s SPADE is an algorithm proposed to find frequent sequences using efficient lattice search techniques and simple joins. SPADE decomposes the original problem into smaller sub-problems using equivalence classes on frequent sequences so that each class can be solved independently. SPADE usually makes only three database scans, first one for frequent 1-sequences, second for frequent 2-sequences, and one more for generating all other frequent sequences. If the support of 2-sequences is available then only one scan is required SPADE outperforms GSP algorithm, by a factor of two, and by an order of magnitude with precomputed support of 2-sequences. It also has excellent scale up properties with respect to a number of parameters such as the number of input-sequences, the number of events per input-sequence, the event size, and the size of potential maximal frequent events and sequences SPADE (D,min_ supp): F1={Frequent items or 1-sequences } F2={ Frequent 2 - sequences} ℇ={equivalence classes Xθ1} for all X∈ ℇ do EnumerateFrequentSeq(X) EnumerateFrequentSeq(S) : for all atoms Ai ∈ S do Ti≠ Ø for all atoms Aj ∈ S with j≥i do R=Ai V Aj if ( Prune(R)==false) then L(R)= L(Ai ) ⋂ L(Aj) if σ(R)≥min_supp then Ti=Ti U {R}; F|R| = F|R| U {R} end if (DepthFirstSearch) then EnumerateFrequentSeq(T_i ) end if(BreadthFirstSearch) then for all Ti≠Ø do EnumerateFrequentSeq(Ti)
To demonstrate an explanation of the algorithms GSP and SPADE an example is given ( The Candidate sets are denoted by C. Ck specifies candidate sets having K items. The Candidate sets that satisfy minimum support belongs to Lk . K denotes number of items in sequence. As a first step, the database needs to be grouped by C1={ A,B,C,D,E,F,G,H} are the initial candidates for the first round. For each item, we should extract the number of occurrences or the support that would eliminate the less frequent items ( The items that satisfies the minimum support 2 are L1={A,B,D,F} C2 can be obtained by joining L1×L1. C2={A→A, A→B, A→D, A→F, AB, AD, AF,B→A, B→B, B→D, B→F, BD, BF, D→A, D→B, D→D, D→F, DF, F→A, F→B, F→D, F→F} The items that satisfies the minimal support: L2={AB, AF, B→A, BF, D→A, D→B, D→F, F→A} In this step, C3 can be obtained by joining L2 X L2
L3={ABF, BF→A, D→B→A, D→F→A, D→BF} In this step C4 can be obtained by joining L3 X L3 L4={D→BF→A} Most of the sequential pattern mining algorithms assume horizontal database layout. SPADE uses vertical database format. We can generate a new mapping model. Then to generate the other sequence elements of size 2, we perform a join between the tables and compare the number of occurrences with the minimal support. Joining A and B items makes two kind of elements, (AB) is the equality join which means they happen at the same time; thus, we’ll notice the same The support for (AB) is 3, having two cases from the same sequence makes a redundancy problem in the model. For A → B, the support is 1. The support for (AD) is 1, it does not satisfy the minimum support. The support for (AF) is 3, it satisfies the minimum support, eve, if the number of occurrences is 4 in the database, there is two elements within the same sequence, so they will be considered as one. The number of occurrences is 0, it does not satisfy the minimum support. The number of occurrences is 4, it satisfies the minimum support. The number of occurrences is 1, it does not satisfy the minimum support. The number of occurrences is 2, both belongs to the same sequence, that makes the support 1, so it does not satisfy the min support. The support is 1, because the occurrences are from the same sequence. Therefore, it does not satisfy the min support. The support is 1. In addition, it does not satisfy the min support. The number of occurrences is 2, it satisfies the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 2, it satisfies the minimum support. The support is 2, it satisfies the minimum support. Tab 25. The support is 2, it satisfies the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 2, it satisfies the minimum support. From equality join and temporal joins, the table showing count of occurrence of sequences of items. L2={(AB),(AF),(BF),B→A,D→A,D→F,D→B,F→A} Generating 3-sequence: The support is 3, it satisfies the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 2, it satisfies the minimum support. The support is 2, it satisfies the minimum support. The support is 2, it satisfies the minimum support. The support is 2, it satisfies the minimum support. The support is 0, it does not satisfy the minimum support. The support is 2, it satisfies the minimum support. The support is 2, it satisfies the minimum support. The support is 0, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. The support is 1, it does not satisfy the minimum support. So, the results are: L3={(ABF), (BF)→A, D→B→A, D→F→A, D→(BF)} Generate 4-sequence: The support is 1, it does not satisfy the minimum support. The support is 2, it satisfies the minimum support. The result: L4={D→(BF)→A}
Client ID
Date
Items
1
5/1/2015
C, D
1
16/1/2016
A, B, C
1
3/1/2017
A, B, F
1
3/1/2017
A, C, D, F
2
8/1/2017
A, B, F
2
10/1/2016
E
3
5/1/2016
A, B, F
4
5/1/2016
D, H, G
4
3/1/2017
B, F
4
8/1/2017
A, G, H
SID
EID
ITEMS
1
1,2,3,4
<(CD)(ABC)(ABF)(ACDF)>
2
1,2
<(ABC)(E)>
3
1
<(ABF)>
4
1,2,3
<(DHG)(BF)(AGH)>
Items
N° d’occurrence
A
4
B
4
C
1
D
2
E
1
F
4
G
1
H
1
Items
N° d’occurrence
A →A
1
A→B
1
A→D
1
A→F
1
AB
3
AD
1
AF
3
B→A
2
B→B
1
B→D
1
B→F
1
BD
1
BF
4
D→A
2
D→B
2
D→D
2
D→F
2
DF
1
F→A
2
F→B
1
F→D
1
F→F
1
ABF
3
AB→A
1
AF →A
1
BF →A
2
D →B →A
2
D →F→ A
2
D →BF
2
F →AF
1
D →FA
1
D→AB
1
F→AB
0
B→AB
1
B→AF
1
Items
N° occurrence
D→BF→A
2
A
B
D
F
SID
EID
SID
EID
SID
EID
SID
EID
1
2
1
2
1
1
1
3
1
3
1
3
1
1
1
4
1
4
2
1
4
1
2
1
2
1
3
1
3
1
3
1
4
2
4
2
4
3
(AB)
SID
EID A
EID B
1
2
2
1
3
3
2
1
1
3
1
1
A→B
SID
EID A
EID B
1
3
4
(AD)
SID
EID A
EID D
1
4
4
(AF)
SID
EID A
EID F
1
3
3
1
4
4
2
2
2
3
1
1
(BD)
SID
EID B
EID D
-
-
-
(BF)
SID
EID B
EID F
1
3
3
2
2
2
3
1
1
4
3
3
(DF)
SID
EID D
EID F
1
4
4
A→A
SID
EID A
EID A
1
2
3
1
3
4
A→D
SID
EID A
EID D
1
2
4
1
3
4
A→F
SID
EID A
EID F
1
2
3
1
2
4
1
3
4
B→A
SID
EID B
EID A
1
2
3
4
3
4
B→B
SID
EID B
EID B
1
2
2
1
3
3
B→D
SID
EID B
EID D
1
2
4
1
3
4
B→F
SID
EID B
EID F
1
3
4
D→D
SID
EID D
EID D
1
1
4
D→A
SID
EID D
EID A
1
1
3
1
1
4
4
1
4
D→B
SID
EID D
EID B
1
1
2
1
1
3
4
1
3
D→F
SID
EID D
EID F
1
1
3
1
1
4
4
1
3
F→F
SID
EID F
EID F
1
3
4
F→D
SID
EID F
EID D
1
3
4
F→B
SID
EID F
EID B
1
3
4
F→A
SID
EID F
EID A
1
3
4
4
3
4
Items
N° occurrence
AB
3
AD
1
AF
3
BD
0
BF
4
DF
1
A→A
1
A→B
1
A→D
1
A→F
1
B→A
2
B→B
1
B→D
1
B→F
1
D→D
1
D→A
2
D→B
2
D→F
2
F→F
1
F→D
1
F→B
1
F→A
1
ABF
SID
EID A
EID B
EID F
1
3
3
3
2
2
2
2
3
1
1
1
AB→A
SID
EID A
EID B
EID A
1
2
2
3
1
3
3
4
2
2
2
-
3
1
1
-
AF→A
SID
EID A
EID F
EID A
1
3
3
3
1
4
4
-
2
2
2
-
3
1
1
-
BF→A
SID
EID B
EID F
EID A
1
3
3
4
2
2
2
-
3
1
1
-
4
3
3
4
D→B→A
SID
EID D
EID B
EID A
1
1
2
3
4
1
3
4
D→F→A
SID
EID D
EID F
EID A
1
1
3
4
4
1
3
4
D→BF
SID
EID D
EID B
EID F
1
1
3
3
2
-
2
2
3
-
1
1
4
1
3
3
F→AF
SID
EID F
EID A
EID F
1
-
3
3
1
-
4
4
2
-
2
2
3
-
1
1
D→AF
SID
EID D
EID A
EID F
1
1
3
3
1
1
4
4
2
-
2
2
3
-
1
1
D→AB
SID
EID D
EID A
EID B
1
1
3
3
1
1
4
4
2
-
2
2
3
-
1
1
F→AB
SID
EID F
EIDA
EID B
1
-
2
2
1
-
3
3
2
-
2
2
3
-
1
1
B→AB
SID
EID B
EID A
EID B
1
-
2
2
1
2
3
3
2
-
2
2
3
-
1
1
B→AF
SID
EID B
EID A
EID F
1
-
2
2
1
2
3
3
2
-
2
2
3
-
1
1
ABF→A
SID
EID A
EID B
EID F
EID A
1
3
3
3
4
2
2
2
2
-
3
1
1
1
-
D→BF→A
SID
EID D
EID B
EID F
EID A
1
1
3
3
4
2
-
2
2
-
3
-
1
1
-
4
1
3
3
4