64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Extracting meaningful information from web data on real estate – challenges and experiences from the Web Intelligence Network

Author

KP
Klaudia Peszat

Co-author

  • D
    Dominik Dabrowski

Conference

64th ISI World Statistics Congress - Ottawa, Canada

Format: IPS Abstract

Keywords: big data, experimental, official statistics

Session: IPS 200 - Challenges of Natural Language Processing techniques in official statistics

Tuesday 18 July 10 a.m. - noon (Canada/Eastern)

Abstract

This paper presents the challenges related to applying Natural Language Processes (NLP) to web data from real estate on-line offers, in order to extract information which could complement official statistics. The study is part of an experimental stream of the ESSnet Web Intelligence Network project, whose aim is to explore the possibility of producing new and augmenting existing statistics via the European platform - Web Intelligence Hub (WIH).
The information acquired from web data may be used to monitor the ongoing changes on the real estate market in a timelier manner and provide new indicators. Especially for observing the trend lines and providing better view on the conditions of the offered properties. On-line real estate sales or rental offers cover a wide range of additional information, from the characteristics of the building, through the property surrounding area, to the elements of amenities available in the property. Those information may be used for the process of imputation or construction of new indicators, allowing to observe a greater variety of offered properties. This leads to the problem of extracting valuable information from highly unstructured data and provide adequate input to the machine learning models to automatically classify large number of offers. It appears that Natural Language Processing techniques appear to be particularly useful. However, a number of challenges and methodological issues related to their application, such as type of method used for building the model (supervised or unsupervised), re-training, manual classification, just to name a few, have to be investigated and considered in great detail.