Study case

Open Class Room
Antoine Pigeau
Oct 16, 2017
Oct 26, 2018
Open Class Room
Recherche - Enseignement supérieur
Environnement Informatique pour l’Apprentissage Humain

Analysis scenario: Mesurer l'évolution des apprenants au cours du temps Export

Study usefull grade
Study clarity grade
I can reuse this study
Noter

Mandatory fields

Name of study Q

OpenClassrooms

Description of study Q

Anticipating drop-out among MOOC learners at early stages of their interaction with the course.

How has the ethical dimension been taken into account? (Discussion, ethics committee, ...) ? Q

Learner IDs in the dataset are encrypted to be anonymous

ex: user-id "8ba9e854028c44e1c3975f3f725f9112"

Name and contact of person which can give informations about data Q

"Alya Itani" <alya.itani@imt-atlantique.fr>; 

"Serge Garlatti" <serge.garlatti@imt-atlantique.fr>; 

"Laurent BRISSON" <laurent.brisson@imt-atlantique.fr>; 

Pour cas d'étude > à fusionner avec "Nom du(es) producteurs" Q

Alya Itani

Additional Fields

Files

Mandatory fields

Problematic description Q

The main objective of this work constitutes the following:

* Spot at learners that are at-risk of dropping upon early stages of their interaction with the course (example: after the end of the second chapter).

* Identify possible reasons for this dropout (poor MOOC design, over or under qualification, digital illiteracy etc..).

* Undergo the necessary intervention (automatic or personalized) to prevent and decrease drop-out rates among learners.

Creation date of problematic Q

May 2016

Description of research questions Q

What are the best ways of achieving an accurate and early prediction of learners' dropout?

How to identify the incorporated reasons for learners drop-out?

What are the most convenient means of intervention to prevent possible drop-out from happening?

 

Methodological considerations

What ethical problems can encountered with this problematic? (E.g. access to individual data, ...) Q

Access to individual data.

References about related problematics Q

Diyi Yang, Tanmay Sinha, David Adamson, and Carolyn Penstein Rose. 2013. ´ Turn on, tune in, drop out: Anticipating student dropouts in massive open online courses. In Proceedings of the 2013 NIPS Data-driven education workshop, Vol. 11. 14.

Keith Devlin. 2013. MOOCs and the Myths of Dropout Rates and Certification. Huff Post College. Retrieved March 2 (2013), 2013.

Martin Hlosta, Zdenek Zdrahal, and Jaroslav Zendulka. 2017. Ouroboros: early identification of at-risk students without models based on legacy data. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference. ACM, 6–15.

Sandeep M Jayaprakash, Erik W Moody, Eitel JM Laur´ıa, James R Regan, and Joshua D Baron. 2014. Early alert of academically at-risk students: An open source analytics initiative. Journal of Learning Analytics 1, 1 (2014), 6–47.

Hanan Khalil and Martin Ebner. 2013. How satisfied are you with your MOOC?-A Research Study on Interaction in Huge Online Courses. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications. 830–839.

Hanan Khalil and Martin Ebner. 2014. MOOCs completion rates and possible methods to improve retention-A literature review. In World Conference on Educational Multimedia, Hypermedia and Telecommunications. 1305–1313.

 Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. 2014. Predicting MOOC dropout over weeks using machine learning methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. 60–65.

Elke Lackner, Martin Ebner, and Mohammad Khalil. 2015. MOOCs as granular systems: design patterns to foster participant activity. Retrieved September 10 (2015), 2015.

Daniel FO Onah, Jane Sinclair, and Russell Boyatt. 2014. Dropout rates of massive open online courses: behavioural patterns. EDULEARN14 Proceedings (2014), 5825–5834.

Dan Colman. 2013. MOOC interrupted: Top 10 reasons our readers didn't finish a massive open online course. Open Culture (2013).

Simon Cross. 2013. Evaluation of the OLDS MOOC curriculum design course: participant perspectives, expectations and experiences. (2013).

Ezekiel J Emanuel. 2013. Online education: MOOCs taken by educated few. Nature 503, 7476 (2013), 342–342.

Shalin Hai-Jew. 2014. Iff and Other Conditionals: Expert Perceptions of the Feasibility. Remote Workforce Training: Effective Technologies and Strategies: Effective Technologies and Strategies (2014), 278.

Yassine Tabaa and Abdellatif Medouri. 2013. LASyM: A learning analytics system for MOOCs. International Journal of Advanced Computer Science and Applications (IJACSA) 4, 5 (2013).

Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284.

Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume III, and Lise Getoor. ´ 2014. Learning latent engagement patterns of students in online courses. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 1272–1278.

Ry Rivard. 2013. Measuring the MOOC dropout rate. Inside Higher Ed 8 (2013), 2013.

Carolyn P Rose and George Siemens. 2014. Shared task on prediction of dropout ´ over time in massively open online courses. In Proc. of EMNLP, Vol. 14. 39.

Miaomiao Wen, Diyi Yang, and Carolyn Rose. 2014. Sentiment Analysis in MOOC Discussion Forums: What does it tell us?. In Educational Data Mining 2014.

Miaomiao Wen, Diyi Yang, and Carolyn Penstein Rose. 2014. Linguistic Reflections of Student Engagement in Massive Open Online Courses. In ICWSM.

Jacob Whitehill, Kiran Mohan, Daniel Seaton, Yigal Rosen, and Dustin Tingley. 2017. Delving Deeper into MOOC Student Dropout Prediction. arXiv preprint arXiv:1702.06404 (2017).

Li Yuan, Stephen Powell, JISC CETIS, et al. 2013. MOOCs and open education: Implications for higher education. (2013).

Additional fields

Description of data

Data location Q

OpenClassrooms Database

Description of the storage format of data (files, data base, ... ) Q

10 JSON files, one for each course, includes the metadata of the course, parts, and sub-parts.

9 CSV files includes the extracted data of learners interaction from OpenClassroom's database (9 database tables)

For more detailed information about the data structure refer to the attached pdf file.

 

Description of data model used to describe analyzable data (e.g. BD, XAPI, CSV, … )

Structure of the 9 CSV files (each representing a table in the OC database)

USER (Utilisateur)
-- string            sdz_user.id
-- boolean        sdz_user.active
-- datetime       sdz_user.birthday
-- string            sdz_user.city
-- string            sdz_user.country
(c'est un champ libre, la data n'est pas toujours cohérente, exemple : France, fr, franc …)
-- string            sdz_user.gender
-- string            sdz_user.locale (langue de l’apprenant , par défaut langage du navigateur utilisé)
-- string            sdz_user.region
-- string            sdz_user.zip_code

USER_PREMIUM (date de premium, il peut y avoir plusieurs entrées par utilisateur)
-- int                 sdz_subscription.user_id
-- datetime       sdz_subscription.started_at
-- datetime       sdz_subscription.ended_at

USER_COURSE_VISUALISATION (trace de visualisation)
-- integer         oc_course_visualisation.course_id
-- integer|null   oc_course_visualisation.part_id
(si null, visualisation de la page d'accueil du cours seulement, sinon visualisation du chapitre)
-- integer        oc_course_visualisation.session_id
-- datetime      oc_course_visualisation.date
-- integer|null  oc_course_visualisation.user_id (null = visiteur anonyme)

USER_FOLLOW_COURSE (event de l’action de suivre un cours par l'apprenant)
-- int                 user_follow_course.course_id,
-- datetime       user_follow_course.created_at,
-- int                 user_follow_course.user_id

USER_UNFOLLOW_COURSE (event de l’action de ne plus suivre un cours par l'apprenant)
-- int                 user_unfollow_course.course_id,
-- datetime       user_unfollow_course.created_at,
-- int                 user_unfollow_course.user_id

COMPLETED_PART (event du mark as complete)
-- integer         claire_completed_part.part_id,
-- boolean        claire_completed_part.completed,
-- datetime       claire_completed_part.created_at,
-- integer         claire_completed_part.user_id

EXERCISE
-- integer         claire_exercise.id,
-- integer         claire_exercise.reference_course_id,
-- integer         claire_exercise.reference_id, (chapitre)
-- string            claire_exercise.type,
-- boolean        claire_exercise.active,
-- integer         claire_exercise.position

USER_EXERCISE_SESSION (session d'exercice)
-- integer         claire_exercise_session.id
-- integer         claire_exercise_session.exercise_id
-- datetime       claire_exercise_session.created_at
-- datetime       claire_exercise_session.completed_at
-- integer         claire_exercise_session.score

USER_COURSE_RESULT (Résultat du cours)
-- integer         claire_user_course_result.reference_id,(le cours)
-- integer         claire_user_course_result.user_score,
-- boolean        claire_user_course_result.passed,
-- integer         claire_user_course_result.passing_score,
-- integer         claire_user_course_result.max_score,
-- datetime       claire_user_course_result.created_at,
-- integer         claire_user_course_result.user_id

Data description (e.g. contents, size, number of records, ...) Q

Dataset includes 10 courses: - Animez une communauté Twitter, - Apprenez à coder avec JavaScript, - Apprenez à créer votre site web avec HTML5 et CSS3, - Comprendre le Web, - Continuez avec Ruby on Rails, - Découvrez les bases de la gestion de projet, - Développez une application mobile multi-plateforme avec Ionic, - Prenez en main Bootstrap, - Programmez vos premiers montages avec Arduino, - Utilisez des API REST dans vos projets web.

1 year of collected data (October 2015 to October 2016)

Number of total learners in all 10 courses: 203541 

Legal proceedings regarding the use of data Q

A signed agreement that states the conditions of the use of data.

OpenClassrooms informs their students of the possible use of their data for research purposes.

Pour cas d'étude - Data properties (name, labs, universities, companies, ...) Q

OpenClassrooms, HUBBLE

Description of data collect

Additional fields

General Information

Purposes of analysis Q

The analysis is complete and can be tested for real-time validation by OpenClassrooms.

Person(s) in charge of the analysis (pre-processing and processing) Q

Alya Itani, Laurent Brisson, Alya Itani

Which actors would be interested in the analysis and why?

The analysis applied can help:

1- MOOC providers in improving their MOOCs through the early prediction of and prevention of drop-out among their learners. Therefore, decreasing customer churn and its followed loss.

2- Teachers in uncovering and solving drop-out reasons related to course design and learners' behavior.

3- Learners in receiving an enhanced learning experience.

State of progress of the analysis scenario (e.g. pre-processing, processing, dashboards, ...)

Preprocessing is complete.

Features selection is complete (extra features can be added in the future)

Analysis or predictive modeling phase is complete.

Evaluation on Held out data is complete.

Real-time evaluation is still undone.

Date or period of the analysis

March to July 2017

Description of learning analytics goals Q

Research Objective:

We focus on anticipating the attrition among MOOC learners and finding the reasons behind this attrition. The main excavated reasons are those related to course design and learner’s behaviors according to the requirements of the MOOC provider OpenClassrooms. The main objective is to help OpenClassrooms in the pursuit of improving their MOOCs, through the detection and corresponding prevention of drop-out among their learners’ population. Two critical business needs are identified in this context; first, the accurate detection of at-risk droppers, which allows sending automated motivational feedback to prevent learners drop-out. Second, the investigation of possible drop-out reasons, which allows reporting to an intermediary, such as platform designers or teachers, who accordingly make the necessary personalized interventions. To meet these needs, we present a generic machine learning based drop-out prediction model, that takes advantage of both predictive and explicative types of machine learning classifiers. This exploration holds three main contributions; (1) Proposing an enhanced reliable dropout-prediction model that is generic in its modeling and evaluation and can be used for effectively spotting at-risk droppers at a specified instance throughout the course. (2) Offering a preliminary insight into the readability of explicative classifiers to determine possible reasons for drop-out. (3) Introducing and testing the effect of advanced features related to the trajectories of learners’ engagement with the course (backward jumps, frequent jumps, inactivity time evolution). The findings of the experimental testing prove the validity of the three above claimed suppositions.

How has the ethical dimension of the analysis been taken into account?

Anonymisation of user-ids

Pre-processing of data

Global description of pre-processing

Collecting and Cleaning the Dataset

The data is first collected in the format of CSV and JSON files through an authorized access to a secured link offered by the MOOC providers. The data is then accurately examined to undergo the necessary structure verification 3 and cleaning. Basically, verifying the structure and efficiency of the traces necessitates knowledge exchange with the MOOC providers. An initial examination of the data is then required to point out any corrupt or doubtful entries and fields within the traces. The performed cleaning of the explored dataset included the following:

• Neglecting missing users: Eliminating all events of user_ids that do not exist in the main user table. the existence of such entries goes back to events of deactivated or erased user accounts.
• Considering most recent results considering the most recent entry as the only entry for each user. The occurrence of multiple entries is justified as a possible error in the system upon exercise sessions.
• Including missing timestamps: Time-stamp fields were missing from the data and were demanded for the accuracy of the analysis, ex. created_at field of the results and exercise session events.
• Removing duplicate entries: redundant duplicate entries in the Course Visualization table were cleaned out, in addition, some missing id values of anonymous visitors in the visualization events.

Plateforms or softwares to pre-process data

Python on Jupyter Notebook.

Treatments of data

Overall description of the treatments used (e.g. make a list of the methods used)

Machine learning classifiers used:
• Gradient Boosting
• Random Forest
• Decision Trees
• Logistic regression
• K Nearest Neighbors

Scripts of processing (download)

All code is shared on Hubble drive.

Additional fields

General information

Description of analysis results Q

In this exploration, two main business needs are pursued, first, achieving an accurate prediction at a certain specified instant of the course, established with the help of predictive classifiers. Second, uncovering the underlying reasons of the predicted drop-out, established with the help of explicative classifiers. 
The experimental findings can be summed by the following points. Fitting the models with proper stratified hyper-parameter tuning and cross-validation has affected positively the predictive performance of the classifiers. Upon testing, all STV fitted classifiers, with RF dominating, succeeded in delivering accurate predictions of at risk learners at the end of the second chapter of the course. Thus, satisfying the need of the MOOC providers in the prospect of sending automated motivational feedback to at risk spotted learners. Moreover, we were able to attain the awaited readability on predictions using the DT classifier. Therefore, making it possible for the MOOC providers and teachers with such information to enhance their courses and personalize their intervention with learners at risk. Finally, including behavioral indicators features to the model slightly enhanced the general predictive power of the proposed model and acted to the favor of better understanding the reasons of drop-out.

Type of results produced (model, indicator, algorithms, ...) Q

An accurate and early drop-out predictive model.

How are the results acceptable from an ethical point of view? Or what are the perceived ethical problems?

Details

Indicator

Indicator's names

Refer to the Indicators Table

Indicators description

Refer to the Indicators Table

Number of dimensions

Each indicator, whether typical or behavioral, constitutes a set of features. The indicators in this case study form a combination of 34 mixed type features (numerical and categorical) alongside one categorical binary target variable with 0 denoting completion (the negative) and 1 denoting dropping (the positive).

Value type (discrete, continue)

Refer to the Indicators Table

Mode of processing (real time, delayed)

one year of recorded data

Dashboards

Ethical Description

How has the ethical dimension been taken into account? (Discussion, ethics committee, ...) ? Q

Learner IDs in the dataset are encrypted to be anonymous

ex: user-id "8ba9e854028c44e1c3975f3f725f9112"

What ethical problems can encountered with this problematic? (E.g. access to individual data, ...) Q

Access to individual data.

Legal proceedings regarding the use of data Q

A signed agreement that states the conditions of the use of data.

OpenClassrooms informs their students of the possible use of their data for research purposes.

How has the ethical dimension of the analysis been taken into account?

Anonymisation of user-ids

State of progress of the analysis scenario (e.g. pre-processing, processing, dashboards, ...)

Preprocessing is complete.

Features selection is complete (extra features can be added in the future)

Analysis or predictive modeling phase is complete.

Evaluation on Held out data is complete.

Real-time evaluation is still undone.

How are the results acceptable from an ethical point of view? Or what are the perceived ethical problems?