Transforming your raw data into an event log object is one of themost challenging tasks in process analysis. On this page, we cover allthe possible situations and challenges that you can encounter.
Logs: eventlog
vs activitylog
bupaR
supports two different kinds of log formats, bothof which are an extension on R data.frame
:
eventlog
: Event logs are created fromdata.frame
in which each row represents a single event.This means that it has a single timestamp.activitylog
: Activity logs are createdfromdata.frame
in which each row represents a singleactivity instances. This means it can has multiple timestamps, stored indifferent columns.
The data model below shows the difference between these two levels ofobservations, i.e.activity instances vs events.
The example below shows an excerpt of an event log containing 6events. It can be seen that each event is linked to a single timestamp.As there can be more events within a single activity instance, eachevent also needs to be linked to a lifecycle status (here theregistration_type). Furthermore, an activity instance identifier(handling_id) is needed to indicated which events belong to the sameactivity instances.
handling | patient | employee | handling_id | registration_type | time |
---|---|---|---|---|---|
Registration | 333 | r1 | 333 | start | 2017-11-15 16:50:59 |
Registration | 333 | r1 | 333 | complete | 2017-11-15 18:45:18 |
Triage and Assessment | 333 | r2 | 833 | start | 2017-11-16 20:37:26 |
Triage and Assessment | 333 | r2 | 833 | complete | 2017-11-17 08:21:08 |
Blood test | 333 | r3 | 1152 | start | 2017-11-17 22:27:09 |
Blood test | 333 | r3 | 1152 | complete | 2017-11-18 03:16:03 |
Transactional lifecycle?
An event is an atomic registration related to an activity instance. It thus contains one (and only one) timestamp. Additionally, the event should include a reference to a lifecycle transition. More specifically, multiple events can describe different lifecycle transitions of a single activity instance. For example, one event might record when a surgery is scheduled, another when it is started, yet another when it is completed, etc.
The standard transactional lifecycle.
The table below show the same data as above, but now using theactivitylog
format. It can be seen that there are now just3 rows instead of 6, but each row as 2 timestamps, representing 2events. The lifecycle status represented by those timestamps is now thecolumn names of those variables.
handling | patient | employee | handling_id | complete | start |
---|---|---|---|---|---|
Registration | 333 | r1 | 333 | 2017-11-15 18:45:18 | 2017-11-15 16:50:59 |
Triage and Assessment | 333 | r2 | 833 | 2017-11-17 08:21:08 | 2017-11-16 20:37:26 |
Blood test | 333 | r3 | 1152 | 2017-11-18 03:16:03 | 2017-11-17 22:27:09 |
As these examples show, both formats can often be used forrepresenting the same process data. However, there are some importantdifferences between them:
- the
eventlog
format has much moreflexibility in terms of lifecycle. There is no limit tothe number of events that can occur in a single activity instance. Ifyour data contains lifecycle statuses such as suspend,resume or reassign, they can be recorded multipletimes within a single activity instance. In theactivitylog
format, as each lifecycle gets is own column, it isn’t possible to havetwo events of the same lifecycle status in a single activityinstance. - the level of observation in an
eventlog
is an event. Asa result, attribute values can be stored at the eventlevel. In anactivitylog
, the level of observationis an activity instance. This means that all additional attributes thatyou have about your process should be at this higher level. For example,an activity instance can only be connected to a single resource in theactivitylog
format, whereas in aneventlog
different events within the same activity instance can have differentresources, of different values for any other attribute. - because of the limited flexibility, an
activitylog
iseasier to make, and typically closer to the format thatyour data is already in (see further below on how to constructlog
objects). As a result of this, there are manysituations in which the analysis of anactivitylog
will bemuch faster compared toeventlog
, where a lot of additionalcomplexity needs to be taken into account.
The right log for the job
Functionalities in bupaR core packages support both formats. 1 As such,the goal of your analysis does not impact the decision. Only thecomplexity of your data is important to make this decision. The preciseformat your raw data is in will further define the preparatory stepsthat are needed. We can distinguish between 3 typical scenarios. Theflowchart below helps you on your way.
An activitylog
is the best option when each row in yourdata is an activity instance, or when events belonging to the sameactivity instance have equal attribute values (e.g.all events areexecuted by the same resource). When these two criteria do not hold, youcan create an eventlog
object.
Scenario 1
If each row in your data.frame
is already an activityinstance, the activitylog
format is the best way to go.Consider the data sample below.
patient | handling | activity_started | activity_ended |
---|---|---|---|
464 | Blood test | 2018-04-06 20:04:09 | 2018-04-07 01:18:17 |
464 | Check-out | 2018-04-12 19:02:11 | 2018-04-12 21:41:01 |
464 | Discuss Results | 2018-04-12 11:00:16 | 2018-04-12 13:59:44 |
464 | MRI SCAN | 2018-04-07 06:30:56 | 2018-04-07 09:37:26 |
464 | Registration | 2018-03-20 19:07:17 | 2018-03-20 21:15:41 |
464 | Triage and Assessment | 2018-03-21 15:58:55 | 2018-03-22 05:21:56 |
As each row contains multiple timestamps, i.e.activity_started andactivity_ended, it is clear that each row represents an activityinstance. Turning this dataset in an activitylog
requiresthe following steps:
- Timestamp variables should be named in correspondence with thestandard Transactional lifecycle.
- Timestamp variables should be of type
Date
orPOSIXct
. - Use the
activitylog
constructor function.
data %>% # rename timestamp variables appropriately dplyr::rename(start = activity_started, complete = activity_ended) %>% # convert timestamps to convert_timestamps(columns = c("start", "complete"), format = ymd_hms) %>% activitylog(case_id = "patient", activity_id = "handling", timestamps = c("start", "complete"))
## # Log of 12 events consisting of:## 1 trace ## 1 case ## 6 instances of 6 activities ## 0 resources ## Events occurred from 2018-03-20 19:07:17 until 2018-04-12 21:41:01 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Timestamps: start, complete ## ## # A tibble: 6 × 5## patient handling start complete .order## <chr> <fct> <dttm> <dttm> <int>## 1 464 Blood test 2018-04-06 20:04:09 2018-04-07 01:18:17 1## 2 464 Check-out 2018-04-12 19:02:11 2018-04-12 21:41:01 2## 3 464 Discuss Results 2018-04-12 11:00:16 2018-04-12 13:59:44 3## 4 464 MRI SCAN 2018-04-07 06:30:56 2018-04-07 09:37:26 4## 5 464 Registration 2018-03-20 19:07:17 2018-03-20 21:15:41 5## 6 464 Triage and Assessment 2018-03-21 15:58:55 2018-03-22 05:21:56 6
Note that in case a resource identifier is available, thisinformation can be added in the activitylog
call.
Scenario 2
If each row in your data.frame
is an event, but allevents that belong to the same activity instance share the sameattribute values, the activitylog
format is again the bestway to go. Consider the data sample below.
patient | handling | employee | handling_id | registration_type | time |
---|---|---|---|---|---|
227 | Registration | r1 | 227 | started | 2017-08-09 19:55:30 |
227 | Triage and Assessment | r2 | 727 | started | 2017-08-09 22:17:43 |
227 | Registration | r1 | 227 | completed | 2017-08-09 22:17:43 |
227 | Triage and Assessment | r2 | 727 | completed | 2017-08-10 15:21:30 |
227 | Blood test | r3 | 1109 | started | 2017-08-17 03:01:24 |
227 | Blood test | r3 | 1109 | completed | 2017-08-17 09:17:20 |
227 | MRI SCAN | r4 | 1346 | started | 2017-08-17 13:15:04 |
227 | MRI SCAN | r4 | 1346 | completed | 2017-08-17 18:47:44 |
227 | Discuss Results | r6 | 1961 | started | 2017-08-22 13:33:38 |
227 | Check-out | r7 | 2456 | started | 2017-08-22 15:38:38 |
227 | Discuss Results | r6 | 1961 | completed | 2017-08-22 15:38:38 |
227 | Check-out | r7 | 2456 | completed | 2017-08-22 17:12:46 |
The resource identifier (employee) has been added as an additionalattribute. Note that though each row is an event, they can be groupedinto activity instances using the handling_id column, which we will callthe activity instance id. Using the latter, we can see that the resourceattribute is the same within each activity instance, which allows us tocreate an activitylog
. The steps to do so are thefollowing.
- Lifecycle variable should be named in correspondence with thestandard Transactional lifecycle.
- Timestamp variable should be of type
Date
orPOSIXct
. - Use the
eventlog
constructor function. - Convert to
activitylog
usingto_activitylog
for reduced memory usage and improvedperformance.
data %>% # recode lifecycle variable appropriately dplyr::mutate(registration_type = forcats::fct_recode(registration_type, "start" = "started", "complete" = "completed")) %>% convert_timestamps(columns = "time", format = ymd_hms) %>% eventlog(case_id = "patient", activity_id = "handling", activity_instance_id = "handling_id", lifecycle_id = "registration_type", timestamp = "time", resource_id = "employee") %>% to_activitylog() -> tmp_act
Note that the resource identifier is optional, and can be left out ofthe eventlog
call if such an attribute does not exist inyour data. If the activity instance id does not exist, some heuristicsare available to generate it: [Missing activity instanceidentifier].
Scenario 3
If each row is an event, and events of the same activity instancehave differing attribute values, the flexibility ofeventlog
objects is required. Consider the data samplebelow.
patient | handling | employee | handling_id | registration_type | time |
---|---|---|---|---|---|
116 | Registration | r2 | 116 | started | 2017-04-29 03:24:59 |
116 | Registration | r6 | 116 | completed | 2017-04-29 06:23:09 |
116 | Triage and Assessment | r1 | 616 | started | 2017-04-29 15:41:27 |
116 | Triage and Assessment | r7 | 616 | completed | 2017-04-30 03:04:21 |
116 | Blood test | r4 | 1054 | started | 2017-04-30 15:13:28 |
116 | Blood test | r6 | 1054 | completed | 2017-04-30 21:24:18 |
116 | MRI SCAN | r1 | 1291 | started | 2017-05-01 01:12:51 |
116 | MRI SCAN | r4 | 1291 | completed | 2017-05-01 05:32:37 |
116 | Discuss Results | r3 | 1850 | started | 2017-05-01 09:44:20 |
116 | Discuss Results | r7 | 1850 | completed | 2017-05-01 14:00:48 |
116 | Check-out | r3 | 2345 | started | 2017-05-03 04:02:35 |
116 | Check-out | r2 | 2345 | completed | 2017-05-03 06:16:03 |
In this example, different resources (employees) sometimes performthe start and complete event of the same activity instance. Therefore,we resort to the eventlog
format which has no problemsstoring this. The steps to take are the following:
- Lifecycle variable should be named in correspondence with thestandard Transactional lifecycle.
- Timestamp variable should be of type
Date
orPOSIXct
. - Use the
eventlog
constructor function.
data %>% # recode lifecycle variable appropriately dplyr::mutate(registration_type = forcats::fct_recode(registration_type, "start" = "started", "complete" = "completed")) %>% convert_timestamps(columns = "time", format = ymd_hms) %>% eventlog(case_id = "patient", activity_id = "handling", activity_instance_id = "handling_id", lifecycle_id = "registration_type", timestamp = "time", resource_id = "employee")
## Warning in validate_eventlog(eventlog): The following activity instances are## connected to more than one resource: 1054,116,1291,1850,2345,616
## # Log of 12 events consisting of:## 1 trace ## 1 case ## 6 instances of 6 activities ## 6 resources ## Events occurred from 2017-04-29 03:24:59 until 2017-05-03 06:16:03 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Activity instance identifier: handling_id ## Timestamp: time ## Lifecycle transition: registration_type ## ## # A tibble: 12 × 7## patient handling employee handling_id registration_type time ## <chr> <fct> <fct> <chr> <fct> <dttm> ## 1 116 Registrat… r2 116 start 2017-04-29 03:24:59## 2 116 Registrat… r6 116 complete 2017-04-29 06:23:09## 3 116 Triage an… r1 616 start 2017-04-29 15:41:27## 4 116 Triage an… r7 616 complete 2017-04-30 03:04:21## 5 116 Blood test r4 1054 start 2017-04-30 15:13:28## 6 116 Blood test r6 1054 complete 2017-04-30 21:24:18## 7 116 MRI SCAN r1 1291 start 2017-05-01 01:12:51## 8 116 MRI SCAN r4 1291 complete 2017-05-01 05:32:37## 9 116 Discuss R… r3 1850 start 2017-05-01 09:44:20## 10 116 Discuss R… r7 1850 complete 2017-05-01 14:00:48## 11 116 Check-out r3 2345 start 2017-05-03 04:02:35## 12 116 Check-out r2 2345 complete 2017-05-03 06:16:03## # ℹ 1 more variable: .order <int>
Note that we need an eventlog
irrespective of whichattribute values are differing, i.e.it can be resources, but also anyadditional variables you have in your data set. For the special case ofresource values, it might be that a different resource executing eventsin the same activity instance is a data quality issue. If so, somefunctions can help you to identify this issue: Inconsistent Resources.
Again, if the activity instance id does not exist, some heuristicsare available to generate it: [Missing activity instanceidentifier].
Typical problems
Missing activity instance id
In order to be able to correlate events which belong to the sameactivity instance, an activity instance identifier is required. Forexample, in the data shown below, it is possible that a patient has gonethrough different surgeries, each with their own start- and completeevent. The activity instance identifier will then allow to distinguishwhich events belong together and which do not. It is important to notethat this instance identifier should be unique, also among differentcases and activities.
patient | activity | timestamp | status | activity_instance |
---|---|---|---|---|
John Doe | check-in | 2017-05-10 08:33:26 | complete | 1 |
John Doe | surgery | 2017-05-10 08:53:16 | start | 2 |
John Doe | surgery | 2017-05-10 09:25:19 | complete | 2 |
John Doe | treatment | 2017-05-10 10:01:25 | start | 3 |
John Doe | treatment | 2017-05-10 10:35:18 | complete | 3 |
John Doe | surgery | 2017-05-10 10:41:35 | start | 4 |
John Doe | surgery | 2017-05-10 11:05:56 | complete | 4 |
John Doe | check-out | 2017-05-11 14:52:36 | complete | 5 |
If the activity instance identifier is not available you can use theassign_instance_id()
function, which uses an heuristic tocreate the missing identifier. Alternatively, you can try to create theidentifier on your own using dplyr::mutate()
and othermanipulation functions.
Large Datasets and Validation
By default, bupaR
validates certain properties of theactivity instances that is supplied when creating an event log:
- a single activity instance identifier must not be connected tomultiple cases,
- a single activity instance identifier must not be connected tomultiple activity labels,
However, these checks are not efficient and may lead to considerableperformance issues for large data frames. It is possible to deactivatethe validation in case you already know that your data fulfills all therequirements, using the argument validate = FALSE
whencreating the eventlog
. Note that when the activity instanceid was created with the assign_instance_id()
function, youcan assume the above properties hold.
Inconsistent Resources
Each event can contain the notion of a resource. It can be so thatdifferent events belonging to the same activity instance are executed bydifferent resources, as in the eventlog
below.
patient | handling | employee | handling_id | registration_type | time | .order |
---|---|---|---|---|---|---|
206 | Registration | r4 | 206 | start | 2017-07-19 15:48:14 | 1 |
206 | Triage and Assessment | r6 | 706 | start | 2017-07-19 17:03:44 | 2 |
206 | Registration | r3 | 206 | complete | 2017-07-19 17:03:44 | 3 |
206 | Triage and Assessment | r7 | 706 | complete | 2017-07-20 07:28:53 | 4 |
206 | Blood test | r1 | 1100 | start | 2017-07-25 03:02:14 | 5 |
206 | Blood test | r3 | 1100 | complete | 2017-07-25 08:14:46 | 6 |
206 | MRI SCAN | r6 | 1337 | start | 2017-07-25 12:37:36 | 7 |
206 | MRI SCAN | r2 | 1337 | complete | 2017-07-25 16:52:16 | 8 |
206 | Discuss Results | r2 | 1940 | start | 2017-07-26 07:36:36 | 9 |
206 | Discuss Results | r4 | 1940 | complete | 2017-07-26 11:08:03 | 10 |
206 | Check-out | r1 | 2435 | start | 2017-07-28 02:54:17 | 11 |
206 | Check-out | r7 | 2435 | complete | 2017-07-28 03:55:13 | 12 |
If you have a large dataset, and want to have an overview of theactivity instances that have more than one resource connected to them,you can use the detect_resource_inconsistences()
function.
log %>% detect_resource_inconsistencies()
## # A tibble: 6 × 5## patient handling handling_id complete start## <chr> <fct> <chr> <chr> <chr>## 1 206 Blood test 1100 r3 r1 ## 2 206 Check-out 2435 r7 r1 ## 3 206 Discuss Results 1940 r4 r2 ## 4 206 MRI SCAN 1337 r2 r6 ## 5 206 Registration 206 r3 r4 ## 6 206 Triage and Assessment 706 r7 r6
If you want to remove these inconsistencies, a quick fix is to mergethe resource labels together withfix_resource_inconsistencies()
. Note that this is notneeded for eventlog
, but it is foractivitylog
. While the creation of theeventlog
will emit a warning when resource inconsistenciesexist, this should mostly be seen as a data quality warning. That said,there might be analysis related to the counting of resources where suchinconsistencies might lead to odd results.
log %>% fix_resource_inconsistencies()
## *** OUTPUT ***
## A total of 6 activity executions in the event log are classified as inconsistencies.
## They are spread over the following cases and activities:
## # A tibble: 6 × 5## patient handling handling_id complete start## <chr> <fct> <chr> <chr> <chr>## 1 206 Blood test 1100 r3 r1 ## 2 206 Check-out 2435 r7 r1 ## 3 206 Discuss Results 1940 r4 r2 ## 4 206 MRI SCAN 1337 r2 r6 ## 5 206 Registration 206 r3 r4 ## 6 206 Triage and Assessment 706 r7 r6
## Inconsistencies solved succesfully.
## # Log of 12 events consisting of:## 1 trace ## 1 case ## 6 instances of 6 activities ## 6 resources ## Events occurred from 2017-07-19 15:48:14 until 2017-07-28 03:55:13 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Activity instance identifier: handling_id ## Timestamp: time ## Lifecycle transition: registration_type ## ## # A tibble: 12 × 7## patient handling employee handling_id registration_type time ## <chr> <fct> <chr> <chr> <fct> <dttm> ## 1 206 Registrat… r3 - r4 206 start 2017-07-19 15:48:14## 2 206 Triage an… r7 - r6 706 start 2017-07-19 17:03:44## 3 206 Registrat… r3 - r4 206 complete 2017-07-19 17:03:44## 4 206 Triage an… r7 - r6 706 complete 2017-07-20 07:28:53## 5 206 Blood test r3 - r1 1100 start 2017-07-25 03:02:14## 6 206 Blood test r3 - r1 1100 complete 2017-07-25 08:14:46## 7 206 MRI SCAN r2 - r6 1337 start 2017-07-25 12:37:36## 8 206 MRI SCAN r2 - r6 1337 complete 2017-07-25 16:52:16## 9 206 Discuss R… r4 - r2 1940 start 2017-07-26 07:36:36## 10 206 Discuss R… r4 - r2 1940 complete 2017-07-26 11:08:03## 11 206 Check-out r7 - r1 2435 start 2017-07-28 02:54:17## 12 206 Check-out r7 - r1 2435 complete 2017-07-28 03:55:13## # ℹ 1 more variable: .order <int>
Read more:
CreateLogsAdjustLogsPublicLogsXesFilesInspectLogsDataQuality