Information Extraction from TV Series Scripts for Uptake Prediction

Wang, Junshu
Nand, Parma
Naeem, Muhammad Asif
Item type
Degree name
Master of Computer and Information Sciences
Journal Title
Journal ISSN
Volume Title
Auckland University of Technology

The script of a movie, or of an episode of a television series, describes the setting, the storyline, and the scene changes. It also details the movement, actions, non-oral expression, and dialogues of the characters. The script is assessed by potential investors. If it is considered to be qualified, a decision is made to arrange funds and other resources to create the real product, i.e. a movie or a television series. This action of approving the project is known as green-lighting.

Many studies have been conducted on building models to predict the success of movies. However, the majority of these studies exploit factors which only become known after the decision of green-lighting, or after the release of the products. Only a few studies have focused on predictive models based on pre-greenlighting factors, which are available before the decision of green-lighting. In comparison, there are even less models that forecast the performance of television series exploiting pre-greenlighting factors.

This study aims to extract features from scripts of pilot episodes, which are the first episodes of television series. These features will be exploited to construct predictive models for uptake of the television series. Three data sources were employed, including the IMDB, the OpenSubtitles2016 corpus, and television series scripts retrieved from multiple websites. The scripts were then parsed, and the structures were analysed. Subsequently, features were extracted and data matrices were generated. These features and data matrices were used in classification algorithms for training and construction of predictive models. The output from the prediction models was then used for prediction of the uptake. However, the results were not as compelling as expected. The present research was compared with previous studies on the same topic. The evaluation results are discussed, and suggestions for future work are given.

Information extraction , Feature extraction , NLP , Prediction , TV Series Scripts , Distributed representation , Dependency parsing
Publisher's version
Rights statement