Study case
Analysis scenario: Mesurer l'évolution des apprenants au cours du temps Export
Mandatory fields
Name of study Q
OpenClassrooms
Description of study Q
Anticipating drop-out among MOOC learners at early stages of their interaction with the course.
How has the ethical dimension been taken into account? (Discussion, ethics committee, ...) ? Q
Learner IDs in the dataset are encrypted to be anonymous
ex: user-id "8ba9e854028c44e1c3975f3f725f9112"
Name and contact of person which can give informations about data Q
"Alya Itani" <alya.itani@imt-atlantique.fr>;
"Serge Garlatti" <serge.garlatti@imt-atlantique.fr>;
"Laurent BRISSON" <laurent.brisson@imt-atlantique.fr>;
Pour cas d'étude > à fusionner avec "Nom du(es) producteurs" Q
Alya Itani
Additional Fields
Files
Mandatory fields
Problematic description Q
The main objective of this work constitutes the following:
* Spot at learners that are at-risk of dropping upon early stages of their interaction with the course (example: after the end of the second chapter).
* Identify possible reasons for this dropout (poor MOOC design, over or under qualification, digital illiteracy etc..).
* Undergo the necessary intervention (automatic or personalized) to prevent and decrease drop-out rates among learners.
Creation date of problematic Q
May 2016
Description of research questions Q
What are the best ways of achieving an accurate and early prediction of learners' dropout?
How to identify the incorporated reasons for learners drop-out?
What are the most convenient means of intervention to prevent possible drop-out from happening?
Methodological considerations
What ethical problems can encountered with this problematic? (E.g. access to individual data, ...) Q
Access to individual data.
References about related problematics Q
Diyi Yang, Tanmay Sinha, David Adamson, and Carolyn Penstein Rose. 2013. ´ Turn on, tune in, drop out: Anticipating student dropouts in massive open online courses. In Proceedings of the 2013 NIPS Data-driven education workshop, Vol. 11. 14.
Keith Devlin. 2013. MOOCs and the Myths of Dropout Rates and Certification. Huff Post College. Retrieved March 2 (2013), 2013.
Martin Hlosta, Zdenek Zdrahal, and Jaroslav Zendulka. 2017. Ouroboros: early identification of at-risk students without models based on legacy data. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference. ACM, 6–15.
Sandeep M Jayaprakash, Erik W Moody, Eitel JM Laur´ıa, James R Regan, and Joshua D Baron. 2014. Early alert of academically at-risk students: An open source analytics initiative. Journal of Learning Analytics 1, 1 (2014), 6–47.
Hanan Khalil and Martin Ebner. 2013. How satisfied are you with your MOOC?-A Research Study on Interaction in Huge Online Courses. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications. 830–839.
Hanan Khalil and Martin Ebner. 2014. MOOCs completion rates and possible methods to improve retention-A literature review. In World Conference on Educational Multimedia, Hypermedia and Telecommunications. 1305–1313.
Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. 2014. Predicting MOOC dropout over weeks using machine learning methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs. 60–65.
Elke Lackner, Martin Ebner, and Mohammad Khalil. 2015. MOOCs as granular systems: design patterns to foster participant activity. Retrieved September 10 (2015), 2015.
Daniel FO Onah, Jane Sinclair, and Russell Boyatt. 2014. Dropout rates of massive open online courses: behavioural patterns. EDULEARN14 Proceedings (2014), 5825–5834.
Dan Colman. 2013. MOOC interrupted: Top 10 reasons our readers didn't finish a massive open online course. Open Culture (2013).
Simon Cross. 2013. Evaluation of the OLDS MOOC curriculum design course: participant perspectives, expectations and experiences. (2013).
Ezekiel J Emanuel. 2013. Online education: MOOCs taken by educated few. Nature 503, 7476 (2013), 342–342.
Shalin Hai-Jew. 2014. Iff and Other Conditionals: Expert Perceptions of the Feasibility. Remote Workforce Training: Effective Technologies and Strategies: Effective Technologies and Strategies (2014), 278.
Yassine Tabaa and Abdellatif Medouri. 2013. LASyM: A learning analytics system for MOOCs. International Journal of Advanced Computer Science and Applications (IJACSA) 4, 5 (2013).
Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284.
Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume III, and Lise Getoor. ´ 2014. Learning latent engagement patterns of students in online courses. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 1272–1278.
Ry Rivard. 2013. Measuring the MOOC dropout rate. Inside Higher Ed 8 (2013), 2013.
Carolyn P Rose and George Siemens. 2014. Shared task on prediction of dropout ´ over time in massively open online courses. In Proc. of EMNLP, Vol. 14. 39.
Miaomiao Wen, Diyi Yang, and Carolyn Rose. 2014. Sentiment Analysis in MOOC Discussion Forums: What does it tell us?. In Educational Data Mining 2014.
Miaomiao Wen, Diyi Yang, and Carolyn Penstein Rose. 2014. Linguistic Reflections of Student Engagement in Massive Open Online Courses. In ICWSM.
Jacob Whitehill, Kiran Mohan, Daniel Seaton, Yigal Rosen, and Dustin Tingley. 2017. Delving Deeper into MOOC Student Dropout Prediction. arXiv preprint arXiv:1702.06404 (2017).
Li Yuan, Stephen Powell, JISC CETIS, et al. 2013. MOOCs and open education: Implications for higher education. (2013).
Additional fields
Description of data
Data location Q
OpenClassrooms Database
Description of the storage format of data (files, data base, ... ) Q
10 JSON files, one for each course, includes the metadata of the course, parts, and sub-parts.
9 CSV files includes the extracted data of learners interaction from OpenClassroom's database (9 database tables)
For more detailed information about the data structure refer to the attached pdf file.
Description of data model used to describe analyzable data (e.g. BD, XAPI, CSV, … )
Structure of the 9 CSV files (each representing a table in the OC database)
USER (Utilisateur)
-- string sdz_user.id
-- boolean sdz_user.active
-- datetime sdz_user.birthday
-- string sdz_user.city
-- string sdz_user.country
(c'est un champ libre, la data n'est pas toujours cohérente, exemple : France, fr, franc …)
-- string sdz_user.gender
-- string sdz_user.locale (langue de l’apprenant , par défaut langage du navigateur utilisé)
-- string sdz_user.region
-- string sdz_user.zip_code
USER_PREMIUM (date de premium, il peut y avoir plusieurs entrées par utilisateur)
-- int sdz_subscription.user_id
-- datetime sdz_subscription.started_at
-- datetime sdz_subscription.ended_at
USER_COURSE_VISUALISATION (trace de visualisation)
-- integer oc_course_visualisation.course_id
-- integer|null oc_course_visualisation.part_id
(si null, visualisation de la page d'accueil du cours seulement, sinon visualisation du chapitre)
-- integer oc_course_visualisation.session_id
-- datetime oc_course_visualisation.date
-- integer|null oc_course_visualisation.user_id (null = visiteur anonyme)
USER_FOLLOW_COURSE (event de l’action de suivre un cours par l'apprenant)
-- int user_follow_course.course_id,
-- datetime user_follow_course.created_at,
-- int user_follow_course.user_id
USER_UNFOLLOW_COURSE (event de l’action de ne plus suivre un cours par l'apprenant)
-- int user_unfollow_course.course_id,
-- datetime user_unfollow_course.created_at,
-- int user_unfollow_course.user_id
COMPLETED_PART (event du mark as complete)
-- integer claire_completed_part.part_id,
-- boolean claire_completed_part.completed,
-- datetime claire_completed_part.created_at,
-- integer claire_completed_part.user_id
EXERCISE
-- integer claire_exercise.id,
-- integer claire_exercise.reference_course_id,
-- integer claire_exercise.reference_id, (chapitre)
-- string claire_exercise.type,
-- boolean claire_exercise.active,
-- integer claire_exercise.position
USER_EXERCISE_SESSION (session d'exercice)
-- integer claire_exercise_session.id
-- integer claire_exercise_session.exercise_id
-- datetime claire_exercise_session.created_at
-- datetime claire_exercise_session.completed_at
-- integer claire_exercise_session.score
USER_COURSE_RESULT (Résultat du cours)
-- integer claire_user_course_result.reference_id,(le cours)
-- integer claire_user_course_result.user_score,
-- boolean claire_user_course_result.passed,
-- integer claire_user_course_result.passing_score,
-- integer claire_user_course_result.max_score,
-- datetime claire_user_course_result.created_at,
-- integer claire_user_course_result.user_id
Data description (e.g. contents, size, number of records, ...) Q
Dataset includes 10 courses: - Animez une communauté Twitter, - Apprenez à coder avec JavaScript, - Apprenez à créer votre site web avec HTML5 et CSS3, - Comprendre le Web, - Continuez avec Ruby on Rails, - Découvrez les bases de la gestion de projet, - Développez une application mobile multi-plateforme avec Ionic, - Prenez en main Bootstrap, - Programmez vos premiers montages avec Arduino, - Utilisez des API REST dans vos projets web.
1 year of collected data (October 2015 to October 2016)
Number of total learners in all 10 courses: 203541
Legal proceedings regarding the use of data Q
A signed agreement that states the conditions of the use of data.
OpenClassrooms informs their students of the possible use of their data for research purposes.
Pour cas d'étude - Data properties (name, labs, universities, companies, ...) Q
OpenClassrooms, HUBBLE
Description of data collect
Additional fields
General Information
Purposes of analysis Q
The analysis is complete and can be tested for real-time validation by OpenClassrooms.
Person(s) in charge of the analysis (pre-processing and processing) Q
Alya Itani, Laurent Brisson, Alya Itani
Which actors would be interested in the analysis and why?
The analysis applied can help:
1- MOOC providers in improving their MOOCs through the early prediction of and prevention of drop-out among their learners. Therefore, decreasing customer churn and its followed loss.
2- Teachers in uncovering and solving drop-out reasons related to course design and learners' behavior.
3- Learners in receiving an enhanced learning experience.
State of progress of the analysis scenario (e.g. pre-processing, processing, dashboards, ...)
Preprocessing is complete.
Features selection is complete (extra features can be added in the future)
Analysis or predictive modeling phase is complete.
Evaluation on Held out data is complete.
Real-time evaluation is still undone.
Date or period of the analysis
March to July 2017
Description of learning analytics goals Q
Research Objective:
We focus on anticipating the attrition among MOOC learners and finding the reasons behind this attrition. The main excavated reasons are those related to course design and learner’s behaviors according to the requirements of the MOOC provider OpenClassrooms. The main objective is to help OpenClassrooms in the pursuit of improving their MOOCs, through the detection and corresponding prevention of drop-out among their learners’ population. Two critical business needs are identified in this context; first, the accurate detection of at-risk droppers, which allows sending automated motivational feedback to prevent learners drop-out. Second, the investigation of possible drop-out reasons, which allows reporting to an intermediary, such as platform designers or teachers, who accordingly make the necessary personalized interventions. To meet these needs, we present a generic machine learning based drop-out prediction model, that takes advantage of both predictive and explicative types of machine learning classifiers. This exploration holds three main contributions; (1) Proposing an enhanced reliable dropout-prediction model that is generic in its modeling and evaluation and can be used for effectively spotting at-risk droppers at a specified instance throughout the course. (2) Offering a preliminary insight into the readability of explicative classifiers to determine possible reasons for drop-out. (3) Introducing and testing the effect of advanced features related to the trajectories of learners’ engagement with the course (backward jumps, frequent jumps, inactivity time evolution). The findings of the experimental testing prove the validity of the three above claimed suppositions.
How has the ethical dimension of the analysis been taken into account?
Anonymisation of user-ids
Pre-processing of data
Global description of pre-processing
Collecting and Cleaning the Dataset
The data is first collected in the format of CSV and JSON files through an authorized access to a secured link offered by the MOOC providers. The data is then accurately examined to undergo the necessary structure verification 3 and cleaning. Basically, verifying the structure and efficiency of the traces necessitates knowledge exchange with the MOOC providers. An initial examination of the data is then required to point out any corrupt or doubtful entries and fields within the traces. The performed cleaning of the explored dataset included the following:
• Neglecting missing users: Eliminating all events of user_ids that do not exist in the main user table. the existence of such entries goes back to events of deactivated or erased user accounts.
• Considering most recent results considering the most recent entry as the only entry for each user. The occurrence of multiple entries is justified as a possible error in the system upon exercise sessions.
• Including missing timestamps: Time-stamp fields were missing from the data and were demanded for the accuracy of the analysis, ex. created_at field of the results and exercise session events.
• Removing duplicate entries: redundant duplicate entries in the Course Visualization table were cleaned out, in addition, some missing id values of anonymous visitors in the visualization events.
Plateforms or softwares to pre-process data
Python on Jupyter Notebook.
Treatments of data
Overall description of the treatments used (e.g. make a list of the methods used)
Machine learning classifiers used:
• Gradient Boosting
• Random Forest
• Decision Trees
• Logistic regression
• K Nearest Neighbors
Scripts of processing (download)
All code is shared on Hubble drive.
Additional fields
General information
Description of analysis results Q
In this exploration, two main business needs are pursued, first, achieving an accurate prediction at a certain specified instant of the course, established with the help of predictive classifiers. Second, uncovering the underlying reasons of the predicted drop-out, established with the help of explicative classifiers.
The experimental findings can be summed by the following points. Fitting the models with proper stratified hyper-parameter tuning and cross-validation has affected positively the predictive performance of the classifiers. Upon testing, all STV fitted classifiers, with RF dominating, succeeded in delivering accurate predictions of at risk learners at the end of the second chapter of the course. Thus, satisfying the need of the MOOC providers in the prospect of sending automated motivational feedback to at risk spotted learners. Moreover, we were able to attain the awaited readability on predictions using the DT classifier. Therefore, making it possible for the MOOC providers and teachers with such information to enhance their courses and personalize their intervention with learners at risk. Finally, including behavioral indicators features to the model slightly enhanced the general predictive power of the proposed model and acted to the favor of better understanding the reasons of drop-out.
Type of results produced (model, indicator, algorithms, ...) Q
An accurate and early drop-out predictive model.
How are the results acceptable from an ethical point of view? Or what are the perceived ethical problems?
Details
Indicator
Indicator's names
Refer to the Indicators Table
Indicators description
Refer to the Indicators Table
Number of dimensions
Each indicator, whether typical or behavioral, constitutes a set of features. The indicators in this case study form a combination of 34 mixed type features (numerical and categorical) alongside one categorical binary target variable with 0 denoting completion (the negative) and 1 denoting dropping (the positive).
Value type (discrete, continue)
Refer to the Indicators Table
Mode of processing (real time, delayed)
one year of recorded data
Dashboards
Ethical Description
How has the ethical dimension been taken into account? (Discussion, ethics committee, ...) ? Q
Learner IDs in the dataset are encrypted to be anonymous
ex: user-id "8ba9e854028c44e1c3975f3f725f9112"
What ethical problems can encountered with this problematic? (E.g. access to individual data, ...) Q
Access to individual data.
Legal proceedings regarding the use of data Q
A signed agreement that states the conditions of the use of data.
OpenClassrooms informs their students of the possible use of their data for research purposes.
How has the ethical dimension of the analysis been taken into account?
Anonymisation of user-ids
State of progress of the analysis scenario (e.g. pre-processing, processing, dashboards, ...)
Preprocessing is complete.
Features selection is complete (extra features can be added in the future)
Analysis or predictive modeling phase is complete.
Evaluation on Held out data is complete.
Real-time evaluation is still undone.