top of page
HOME
SERVICES

Shorter-is-Better:

Venue Category Estimation from Micro-Video

Abstract

Nowadays, it has become convenient to capture images and videos on the  mobile end and associate them with GPS tags.  Such a hybrid data structure can benefit a wide variety of potential multimedia applications, such as location recognition, landmark search, augmented reality, and commercial recommendations. It hence has attracted great attention from the multimedia community. Meanwhile, the late 2012 has seen a dramatic shift in the way Internet users digest videos: micro-videos spread rapidly across various online flagship platforms, such as Vine. The emergence of vast amounts of micro-videos data is a great reason  for multimedia community   to start looking globally the venue category estimation of micro-videos.

Dataset

We crawled the micro-videos from Vine through its public API (https://github.com/davoclavo/vinepy). In particular, we first manually chose a small set of active users from Vine as our seed users. We then adopted the breadth-first strategy to expand our user sets via gathering their followers. We terminated our expansion after three layers. For each collected user, we crawled his/her published videos, video descriptions and venue information if available. In such way, we harvested 2 million micro-videos. Thereinto, only about 24,000 micro-videos contain Foursquare check-in information. After removing the duplicate venue IDs, we further expanded our video set by crawling all videos in each venue ID with the help of vinepy API. This eventually yielded a dataset of 276,264 videos distributed in 442 Foursquare venue categories. Each venue ID was mapped to a venue category via the Foursquare API (https://developer.foursquare.com/categorytree), which serves as the ground truth. As shown in Figure (1) , 99.8% of videos are shorter than 7 seconds. 

Foursquare organizes a four-layer hierarchical structure over venue categories  (Figure .2){https://developer.foursquare.com/categorytree}, (you can download the high quality (svg format) of category tree in here) with 341, 312 and 52 leaf nodes in the second-layer, third-layer and fourth-layer, respectively. The top-layer of this structure contains ten non-leaf nodes (coarse venue categories). To visualize the coverage and representativeness of our collected micro-videos, we plotted and compared the distribution curves over the number of leaf categories between our dataset and the orginal structure, as shown in  Figure (3).

 

Figure 1: Duration distribution in our dataset

 

Figure 2: The hierarchical structure of the venue categories.

 

Figure 3: Top-level venue category distribution in terms of the number of their leaf nodes

 

WORK
FACTS
CLIENTS

Figure 4: Some examples of micro-videos

 

Multi-modality Feature Extraction

We extracted a rich set of features from visual, acoustic and textual modalities, respectively.

 

1. Visual Modality:

 

 

We employed the AlexNet  model to extract the visual features through the publicly available Caffe .  The model was pre-trained on a set of 1.2 million clean images of ILSVRC-2012{http://www.image-net.org/challenges/LSVRC/2012/} and it hence provides a robust initialization for recognizing semantics. Before the feature extraction, we first extracted the key frames from each micro-video by using OPENCV {http://opencv.org/}, and then the AlexNet was employed to get CNN features of each frame. Following that, we took the mean pooling strategy over all the key frames of one video, and generated a single 4,096 dimensional vector for each micro-video.

 

2. Acoustic Modality:

 

 

To extract the acoustic features, we first separated audio tracks from micro-videos with the help of FFmpeg{https://www.ffmpeg.org/}. Hereafter, the audio tracks were transformed into a uniform format: 22,050Hz, 16bits, mono-channel and pulse-code modulation signals. We then performed a spectrogram with a 46ms window and 50% overlap via librosa {https://github.com/bmcfee/librosa}.  After getting the shallow representation of each audio track with 512 dimensional features, we adopted theano  to learn the deep learning features. In particular, the stack Denosing AutoEncoder (DAE), was employed to extract acoustic features. We ultimately obtained 200 dimensional acoustic features for each micro-video.

3. Textual Modality:

 

 

we utilized the Paragraph Vector method, which has been proven to be effective to alleviate the semantic problems of word sparseness. In particular, we first eliminated the non-English characters, followed by removing the stop words. We then employed Sentence2Vector tool {https://github.com/klb3713/sentence2vec} to extract the textual features. We finally extracted 100 dimensional features for each micro-video description.

Attention:this is a temporal page, the permanent url is:

http://github.com/lengyuexiang/Venue-Category-Estimation-from-Micro-videos

Download

Our code can be available: 

         1. Python version: TRUMMAN_MODEL.py

              (dependence : python 3.4;  numpy1.1; sklearn 0.17)

         2. Matlab version: TRUMMAN_MODEL.rar

             (dependence : R2015a)

 

The code of baselines can be available:

         1. SRMTL, RMTL:    http://www.public.asu.edu/~jye02/Software/MALSAR/

         2. regMVMT:  please contact with     jtzhang@gmail.com.

         3. MvDA :  http://vipl.ict.ac.cn/resources/codes 

 

Our dataset can be available: 

         1.  Raw Feature+Lablel (dataset.h5), you need to complete the missing data.

         2.  Tree structure (python version: tree.pkl;   matlab version: tree.mat )

         3.  Group weight (python version: tree_group.pkl;   matlab version: tree_group.mat )

         4.  Textual description(description.text)

         5.  Hierarchical Foursquare Categories (hierarchy.json). You can learned the group structure form here.


  

TEAM

GET IN

TOUCH

Success! Message received.

CONTACT
bottom of page