You are on page 1of 4

Data problem using RapidMiner (Python / SciKitLearn)

Problem:

I have a synthetic log file which has all the important features of a real life log Im typically encountering
which looks like this:

Table 1: Log

The whole log, you should work with, you can find in the attachment syntheticLog.csv.

The log essentially is made up of a random combination of these six sequence blocks. (This is now just
for explanation the only thing available for analytics is the log above.):
Table 2: Information to be discovered

Note that in types A and B the motions 2 and three are occurring at the same time. Therefore their
sequence in the log is random (could be 2 3 or 3 2).

Goal:

As the outcome I want the log to be completed as follows:

Table 3: End result


Expert Knowledge:

- There are different types (here A and B) and transitions (here TAB and TBA) that might occur if
two different types follow each other.
- If two different types follow each other and a transition (here TAB) exists then the transition
belongs to the second type unless the time frame of the transition falls completely into the time
frame of the first type (here TBA) in which case the transition can just be ignored.
- Each unique style / option combination equals one type
- One could think I could simply start at the bottom, read style, option and seqNum and copy
those in all empty fields above until I encounter a new set of style, option and seqNum.
However its not that easy because motion 8 for example belongs to style B in table 2 where
motion 9 (as a transition) could belong to style A. How could I distinguish this without knowing
the pattern of type A and B? How could motion 9 in the above example be ignored?
- The base pattern of type A or B can only be seen if A follows A or B follows B since there are no
transitions in between.
- In reality the length of the type blocks is much greater and could be much different length from
type to type (i.e. type A could have 20 log events and type B only 5)
- The two parallel motions 2, 3 are also only an example there could be more than two parallel
motions
- At the beginning and at the end of the log is not necessarily a complete case type as the log
simply represents a snap shot

My non machine learning solution:

I solved the problem with sequence mining using GSM as well as SPADE with the following steps:
- Find all the reoccurring sequences
- Based on style and option combination pick the shortest sequence for every type
- Loop through the log, find the occurrences of those shortest sequences and fill in the blank
style, option and seqNum in those blocks with the information already available within the same
block.
- Loop though the log and check if any of the now remaining transitions (with blank style, option,
seqNum) fall into the time frame of the preceding type. If so mark them so their style, option,
seqNum will remain blank
- Loop through the log again from the bottom, read style, option and seqNum and copy those in
all empty fields above until I encounter a new set of style, option and seqNum, unless the fields
have been marked to remain blank

Since GSM and SPADE, based on the rapid advances in computer science, can be considered as
antiquated and possibly slow, I want to find a machine learning approach that will yield the same result.

My expectations:

- Find a working machine learning solution described for the problem above. Only use neuronal
networks or deep learning if you can convince me that no other machine learning solution is
applicable.
- Explain to me, best in writing, how the expert knowledge needs to be incorporated into the
table in form of additional features and how that can be done within RapidMiner.
- Explain to me, best in writing, the steps necessary in RapidMiner to get the result shown in
table 3
- If a solution in RapidMiner is not possible show me a well-documented solution using Jupyter
Notebook, Python and SciKitLearn.
- Tell me, best in writing, how the above solution could be transferred to a standalone application

You might also like