Cas d'étude

Open Class Room
Antoine Pigeau
16 oct. 2017
26 oct. 2018
Open Class Room
Recherche - Enseignement supérieur
Environnement Informatique pour l’Apprentissage Humain

Scénario d'analyse: Mesurer l'évolution des apprenants au cours du temps Export

Note d'utilité de l'étude
Clareté de l'étude
Je peux réutiliser l'étude
Noter

Champs obligatoires

Nom de l'étude Q

OpenClassrooms

Description de l'étude Q

Anticipating drop-out among MOOC learners at early stages of their interaction with the course.

Comment la dimension éthique de l'étude a-t-elle été prise en compte ? Q

Learner IDs in the dataset are encrypted to be anonymous

ex: user-id "8ba9e854028c44e1c3975f3f725f9112"

Nom et contact des personnes qui peuvent donner des informations sur les données Q

"Alya Itani" <alya.itani@imt-atlantique.fr>; 

"Serge Garlatti" <serge.garlatti@imt-atlantique.fr>; 

"Laurent BRISSON" <laurent.brisson@imt-atlantique.fr>; 

Nom du partenaire de l'étude du cas Q

Alya Itani

Champs complémentaires

Fichiers associés

Champs obligatoires

Description de la problématique Q

The main objective of this work constitutes the following:

* Spot at learners that are at-risk of dropping upon early stages of their interaction with the course (example: after the end of the second chapter).

* Identify possible reasons for this dropout (poor MOOC design, over or under qualification, digital illiteracy etc..).

* Undergo the necessary intervention (automatic or personalized) to prevent and decrease drop-out rates among learners.

Date de création de la problématique Q

May 2016

Description des questions de recherche Q

What are the best ways of achieving an accurate and early prediction of learners' dropout?

How to identify the incorporated reasons for learners drop-out?

What are the most convenient means of intervention to prevent possible drop-out from happening?

 

Considérations méthodologiques

Quels problèmes éthiques peuvent se poser avec cette problématique ? ( e.g. accès à des données individuelles, ...) Q

Access to individual data.

Réferences sur des problématiques proches Q

Diyi Yang, Tanmay Sinha, David Adamson, and Carolyn Penstein Rose. 2013. ´ Turn on, tune in, drop out: Anticipating student dropouts in massive open online courses. In Proceedings of the 2013 NIPS Data-driven education workshop, Vol. 11. 14.

Keith Devlin. 2013. MOOCs and the Myths of Dropout Rates and Certification. Huff Post College. Retrieved March 2 (2013), 2013.

Martin Hlosta, Zdenek Zdrahal, and Jaroslav Zendulka. 2017. Ouroboros: early identification of at-risk students without models based on legacy data. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference. ACM, 6–15.

Sandeep M Jayaprakash, Erik W Moody, Eitel JM Laur´ıa, James R Regan, and Joshua D Baron. 2014. Early alert of academically at-risk students: An open source analytics initiative. Journal of Learning Analytics 1, 1 (2014), 6–47.

Hanan Khalil and Martin Ebner. 2013. How satisfied are you with your MOOC?-A Research Study on Interaction in Huge Online Courses. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications. 830–839.

Hanan Khalil and Martin Ebner. 2014. MOOCs completion rates and possible methods to improve retention-A literature review. In World Conference on Educational Multimedia, Hypermedia and Telecommunications. 1305–1313.

 Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. 2014. Predicting MOOC dropout over weeks using machine learning methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. 60–65.

Elke Lackner, Martin Ebner, and Mohammad Khalil. 2015. MOOCs as granular systems: design patterns to foster participant activity. Retrieved September 10 (2015), 2015.

Daniel FO Onah, Jane Sinclair, and Russell Boyatt. 2014. Dropout rates of massive open online courses: behavioural patterns. EDULEARN14 Proceedings (2014), 5825–5834.

Dan Colman. 2013. MOOC interrupted: Top 10 reasons our readers didn't finish a massive open online course. Open Culture (2013).

Simon Cross. 2013. Evaluation of the OLDS MOOC curriculum design course: participant perspectives, expectations and experiences. (2013).

Ezekiel J Emanuel. 2013. Online education: MOOCs taken by educated few. Nature 503, 7476 (2013), 342–342.

Shalin Hai-Jew. 2014. Iff and Other Conditionals: Expert Perceptions of the Feasibility. Remote Workforce Training: Effective Technologies and Strategies: Effective Technologies and Strategies (2014), 278.

Yassine Tabaa and Abdellatif Medouri. 2013. LASyM: A learning analytics system for MOOCs. International Journal of Advanced Computer Science and Applications (IJACSA) 4, 5 (2013).

Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284.

Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume III, and Lise Getoor. ´ 2014. Learning latent engagement patterns of students in online courses. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 1272–1278.

Ry Rivard. 2013. Measuring the MOOC dropout rate. Inside Higher Ed 8 (2013), 2013.

Carolyn P Rose and George Siemens. 2014. Shared task on prediction of dropout ´ over time in massively open online courses. In Proc. of EMNLP, Vol. 14. 39.

Miaomiao Wen, Diyi Yang, and Carolyn Rose. 2014. Sentiment Analysis in MOOC Discussion Forums: What does it tell us?. In Educational Data Mining 2014.

Miaomiao Wen, Diyi Yang, and Carolyn Penstein Rose. 2014. Linguistic Reflections of Student Engagement in Massive Open Online Courses. In ICWSM.

Jacob Whitehill, Kiran Mohan, Daniel Seaton, Yigal Rosen, and Dustin Tingley. 2017. Delving Deeper into MOOC Student Dropout Prediction. arXiv preprint arXiv:1702.06404 (2017).

Li Yuan, Stephen Powell, JISC CETIS, et al. 2013. MOOCs and open education: Implications for higher education. (2013).

Champs complémentaires

Description des données

Localisation des données Q

OpenClassrooms Database

Description de la structure physique de stockage des données (e.g. structure en forme de répertoire, base de données , fichiers CSV, ... ) Q

10 JSON files, one for each course, includes the metadata of the course, parts, and sub-parts.

9 CSV files includes the extracted data of learners interaction from OpenClassroom's database (9 database tables)

For more detailed information about the data structure refer to the attached pdf file.

 

Description du modèle de données utilisé sur les données analysables (BD, XAPI, CSV, … )

Structure of the 9 CSV files (each representing a table in the OC database)

USER (Utilisateur)
-- string            sdz_user.id
-- boolean        sdz_user.active
-- datetime       sdz_user.birthday
-- string            sdz_user.city
-- string            sdz_user.country
(c'est un champ libre, la data n'est pas toujours cohérente, exemple : France, fr, franc …)
-- string            sdz_user.gender
-- string            sdz_user.locale (langue de l’apprenant , par défaut langage du navigateur utilisé)
-- string            sdz_user.region
-- string            sdz_user.zip_code

USER_PREMIUM (date de premium, il peut y avoir plusieurs entrées par utilisateur)
-- int                 sdz_subscription.user_id
-- datetime       sdz_subscription.started_at
-- datetime       sdz_subscription.ended_at

USER_COURSE_VISUALISATION (trace de visualisation)
-- integer         oc_course_visualisation.course_id
-- integer|null   oc_course_visualisation.part_id
(si null, visualisation de la page d'accueil du cours seulement, sinon visualisation du chapitre)
-- integer        oc_course_visualisation.session_id
-- datetime      oc_course_visualisation.date
-- integer|null  oc_course_visualisation.user_id (null = visiteur anonyme)

USER_FOLLOW_COURSE (event de l’action de suivre un cours par l'apprenant)
-- int                 user_follow_course.course_id,
-- datetime       user_follow_course.created_at,
-- int                 user_follow_course.user_id

USER_UNFOLLOW_COURSE (event de l’action de ne plus suivre un cours par l'apprenant)
-- int                 user_unfollow_course.course_id,
-- datetime       user_unfollow_course.created_at,
-- int                 user_unfollow_course.user_id

COMPLETED_PART (event du mark as complete)
-- integer         claire_completed_part.part_id,
-- boolean        claire_completed_part.completed,
-- datetime       claire_completed_part.created_at,
-- integer         claire_completed_part.user_id

EXERCISE
-- integer         claire_exercise.id,
-- integer         claire_exercise.reference_course_id,
-- integer         claire_exercise.reference_id, (chapitre)
-- string            claire_exercise.type,
-- boolean        claire_exercise.active,
-- integer         claire_exercise.position

USER_EXERCISE_SESSION (session d'exercice)
-- integer         claire_exercise_session.id
-- integer         claire_exercise_session.exercise_id
-- datetime       claire_exercise_session.created_at
-- datetime       claire_exercise_session.completed_at
-- integer         claire_exercise_session.score

USER_COURSE_RESULT (Résultat du cours)
-- integer         claire_user_course_result.reference_id,(le cours)
-- integer         claire_user_course_result.user_score,
-- boolean        claire_user_course_result.passed,
-- integer         claire_user_course_result.passing_score,
-- integer         claire_user_course_result.max_score,
-- datetime       claire_user_course_result.created_at,
-- integer         claire_user_course_result.user_id

Description des données (contenu,taille, nombre d'enregistrements...) Q

Dataset includes 10 courses: - Animez une communauté Twitter, - Apprenez à coder avec JavaScript, - Apprenez à créer votre site web avec HTML5 et CSS3, - Comprendre le Web, - Continuez avec Ruby on Rails, - Découvrez les bases de la gestion de projet, - Développez une application mobile multi-plateforme avec Ionic, - Prenez en main Bootstrap, - Programmez vos premiers montages avec Arduino, - Utilisez des API REST dans vos projets web.

1 year of collected data (October 2015 to October 2016)

Number of total learners in all 10 courses: 203541 

Procédures légales relatives à l'utilisation des données Q

A signed agreement that states the conditions of the use of data.

OpenClassrooms informs their students of the possible use of their data for research purposes.

Pour cas d'étude - Propriété des données (nom, laboratoire ou entreprise propriétaire des données) Q

OpenClassrooms, HUBBLE

Description de la collecte des données

Champs complémentaires

Informations générales

Finalités de l'analyse Q

The analysis is complete and can be tested for real-time validation by OpenClassrooms.

Responsable(s) de l'analyse (pre-traitement et traitement) Q

Alya Itani, Laurent Brisson, Alya Itani

Acteurs susceptibles d'être intéressés par l'analyse et pourquoi ?

The analysis applied can help:

1- MOOC providers in improving their MOOCs through the early prediction of and prevention of drop-out among their learners. Therefore, decreasing customer churn and its followed loss.

2- Teachers in uncovering and solving drop-out reasons related to course design and learners' behavior.

3- Learners in receiving an enhanced learning experience.

Etat d'avancement du scénario d'analyse (e.g.pré-traitement, traitement, tableaux de bords, ...)

Preprocessing is complete.

Features selection is complete (extra features can be added in the future)

Analysis or predictive modeling phase is complete.

Evaluation on Held out data is complete.

Real-time evaluation is still undone.

Date ou période de l'analyse

March to July 2017

Objectifs de l'analyse pour le "learning analytics" Q

Research Objective:

We focus on anticipating the attrition among MOOC learners and finding the reasons behind this attrition. The main excavated reasons are those related to course design and learner’s behaviors according to the requirements of the MOOC provider OpenClassrooms. The main objective is to help OpenClassrooms in the pursuit of improving their MOOCs, through the detection and corresponding prevention of drop-out among their learners’ population. Two critical business needs are identified in this context; first, the accurate detection of at-risk droppers, which allows sending automated motivational feedback to prevent learners drop-out. Second, the investigation of possible drop-out reasons, which allows reporting to an intermediary, such as platform designers or teachers, who accordingly make the necessary personalized interventions. To meet these needs, we present a generic machine learning based drop-out prediction model, that takes advantage of both predictive and explicative types of machine learning classifiers. This exploration holds three main contributions; (1) Proposing an enhanced reliable dropout-prediction model that is generic in its modeling and evaluation and can be used for effectively spotting at-risk droppers at a specified instance throughout the course. (2) Offering a preliminary insight into the readability of explicative classifiers to determine possible reasons for drop-out. (3) Introducing and testing the effect of advanced features related to the trajectories of learners’ engagement with the course (backward jumps, frequent jumps, inactivity time evolution). The findings of the experimental testing prove the validity of the three above claimed suppositions.

Comment la dimension éthique de l'analyse a-t-elle été prise en compte ?

Anonymisation of user-ids

Pré-traitement des données

Description globale des pre-traitements

Collecting and Cleaning the Dataset

The data is first collected in the format of CSV and JSON files through an authorized access to a secured link offered by the MOOC providers. The data is then accurately examined to undergo the necessary structure verification 3 and cleaning. Basically, verifying the structure and efficiency of the traces necessitates knowledge exchange with the MOOC providers. An initial examination of the data is then required to point out any corrupt or doubtful entries and fields within the traces. The performed cleaning of the explored dataset included the following:

• Neglecting missing users: Eliminating all events of user_ids that do not exist in the main user table. the existence of such entries goes back to events of deactivated or erased user accounts.
• Considering most recent results considering the most recent entry as the only entry for each user. The occurrence of multiple entries is justified as a possible error in the system upon exercise sessions.
• Including missing timestamps: Time-stamp fields were missing from the data and were demanded for the accuracy of the analysis, ex. created_at field of the results and exercise session events.
• Removing duplicate entries: redundant duplicate entries in the Course Visualization table were cleaned out, in addition, some missing id values of anonymous visitors in the visualization events.

Plateformes ou logiciels pour pre-traiter les données

Python on Jupyter Notebook.

Traitement des données

Description globale des traitements mis en place (e.g. faire une liste des méthodes utilisées)

Machine learning classifiers used:
• Gradient Boosting
• Random Forest
• Decision Trees
• Logistic regression
• K Nearest Neighbors

Scripts des traitements (télécharger)

All code is shared on Hubble drive.

Champs complémentaires

Informations générales

Description des résultats de l'analyse Q

In this exploration, two main business needs are pursued, first, achieving an accurate prediction at a certain specified instant of the course, established with the help of predictive classifiers. Second, uncovering the underlying reasons of the predicted drop-out, established with the help of explicative classifiers. 
The experimental findings can be summed by the following points. Fitting the models with proper stratified hyper-parameter tuning and cross-validation has affected positively the predictive performance of the classifiers. Upon testing, all STV fitted classifiers, with RF dominating, succeeded in delivering accurate predictions of at risk learners at the end of the second chapter of the course. Thus, satisfying the need of the MOOC providers in the prospect of sending automated motivational feedback to at risk spotted learners. Moreover, we were able to attain the awaited readability on predictions using the DT classifier. Therefore, making it possible for the MOOC providers and teachers with such information to enhance their courses and personalize their intervention with learners at risk. Finally, including behavioral indicators features to the model slightly enhanced the general predictive power of the proposed model and acted to the favor of better understanding the reasons of drop-out.

Type de résultats produits (modèle, indicateur, algo, … ) Q

An accurate and early drop-out predictive model.

En quoi les résultats sont ils acceptables d'un point de vue éthique ? ou quels sont les problèmes éthiques perçus ?

Détails

Indicator

Nom de l'indicateur

Refer to the Indicators Table

Description de l'indicateur

Refer to the Indicators Table

Nombre de dimensions

Each indicator, whether typical or behavioral, constitutes a set of features. The indicators in this case study form a combination of 34 mixed type features (numerical and categorical) alongside one categorical binary target variable with 0 denoting completion (the negative) and 1 denoting dropping (the positive).

Type de valeurs (continue, discrete)

Refer to the Indicators Table

Mode de calcul (temps réel, post session)

one year of recorded data

Tableaux de bords

Description des aspects éthiques

Comment la dimension éthique de l'étude a-t-elle été prise en compte ? Q

Learner IDs in the dataset are encrypted to be anonymous

ex: user-id "8ba9e854028c44e1c3975f3f725f9112"

Quels problèmes éthiques peuvent se poser avec cette problématique ? ( e.g. accès à des données individuelles, ...) Q

Access to individual data.

Procédures légales relatives à l'utilisation des données Q

A signed agreement that states the conditions of the use of data.

OpenClassrooms informs their students of the possible use of their data for research purposes.

Comment la dimension éthique de l'analyse a-t-elle été prise en compte ?

Anonymisation of user-ids

Etat d'avancement du scénario d'analyse (e.g.pré-traitement, traitement, tableaux de bords, ...)

Preprocessing is complete.

Features selection is complete (extra features can be added in the future)

Analysis or predictive modeling phase is complete.

Evaluation on Held out data is complete.

Real-time evaluation is still undone.

En quoi les résultats sont ils acceptables d'un point de vue éthique ? ou quels sont les problèmes éthiques perçus ?