research-article

Visually Grounded Language Learning for Robot Navigation

Authors:
Emre Ünal

Koç University, Istanbul, Turkey

Koç University, Istanbul, Turkey
View Profile

,
Ozan Arkan Can

Koç University, Istanbul, Turkey

Koç University, Istanbul, Turkey
View Profile

,
Yücel Yemez

Koç University, Istanbul, Turkey

Koç University, Istanbul, Turkey
View Profile

MULEA '19: 1st International Workshop on Multimodal Understanding and Learning for Embodied ApplicationsOctober 2019Pages 27–32https://doi.org/10.1145/3347450.3357655

Published:15 October 2019Publication History

MULEA '19: 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

Pages 27–32

ABSTRACT

We present an end-to-end deep learning model for robot navigation from raw visual pixel input and natural text instructions. The proposed model is an LSTM-based sequence-to-sequence neural network architecture with attention, which is trained on instruction-perception data samples collected in a synthetic environment. We conduct experiments on the SAIL dataset which we reconstruct in 3D so as to generate the 2D images associated with the data. Our experiments show that the performance of our model is on a par with state-of-the-art, despite the fact that it learns navigational language with end-to-end training from raw visual data.

References

Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction following. arXiv preprint arXiv:1508.06491 (2015).Google Scholar
Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014. Learning compact lexicons for CCG semantic parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . 1273--1283.Google ScholarCross Ref
Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association of Computational Linguistics , Vol. 1 (2013), 49--62.Google ScholarCross Ref
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
Benjamin Börschinger, Bevan K Jones, and Mark Johnson. 2011. Reducing grounded learning tasks to grammatical inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1416--1425.Google Scholar
Ozan Arkan Can and Deniz Yuret. 2018. A new dataset and model for learning to understand navigational instructions. arXiv preprint arXiv:1805.07952 (2018).Google Scholar
David L Chen. 2012. Fast online lexicon learning for grounded language acquisition. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 430--439.Google ScholarDigital Library
David L Chen and Raymond J Mooney. 2011. Learning to interpret natural language navigation instructions from observations.. In AAAI , Vol. 2. 1--2.Google Scholar
Daniel Fried, Jacob Andreas, and Dan Klein. 2017. Unified Pragmatic Models for Generating and Following Instructions. arXiv preprint arXiv:1711.04987 (2017).Google Scholar
Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-Follower Models for Vision-and-Language Navigation. arXiv preprint arXiv:1806.02724 (2018).Google Scholar
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks , Vol. 18, 5--6 (2005), 602--610.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation , Vol. 9, 8 (1997), 1735--1780.Google Scholar
Rohit J Kate and Raymond J Mooney. 2006. Using string-kernels for learning semantic parsers. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 913--920.Google Scholar
Joohyun Kim and Raymond Mooney. 2013. Adapting discriminative reranking to grounded language learning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 218--227.Google Scholar
Joohyun Kim and Raymond J Mooney. 2012. Unsupervised pcfg induction for grounded language learning with highly ambiguous supervision. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 433--444.Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Tomávs Kovc iskỳ , Gábor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, and Karl Moritz Hermann. 2016. Semantic parsing with semi-supervised sequential autoencoders. arXiv preprint arXiv:1609.09315 (2016).Google Scholar
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).Google Scholar
Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. Def , Vol. 2, 6 (2006), 4.Google Scholar
Hongyuan Mei, Mohit Bansal, and Matthew R Walter. 2016. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences.. In AAAI, Vol. 1. 2.Google Scholar
Tomávs Mikolov, Martin Karafiát, Lukávs Burget, Jan vC ernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association .Google ScholarCross Ref
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807--814.Google ScholarDigital Library
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , Vol. 115, 3 (2015), 211--252.Google ScholarDigital Library
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.Google Scholar
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 23--30.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.Google Scholar
Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 806--814.Google Scholar
Paul J Werbos. 1990. Backpropagation through time: what it does and how to do it. Proc. IEEE , Vol. 78, 10 (1990), 1550--1560.Google ScholarCross Ref
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.Google ScholarDigital Library
Deniz Yuret. 2016. Knet: beginning deep learning with 100 lines of julia. In Machine Learning Systems Workshop at NIPS, Vol. 2016. 5.Google Scholar

Index Terms

Visually Grounded Language Learning for Robot Navigation
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Biomimetic application of desert ant visual navigation for mobile robot docking with weighted landmarks

Previous work has shown that honeybees use a snapshot model to determine a local vector to find their way home. A simpler, average landmark vector model has since been proposed for biologically-inspired mobile robot homing. Previously, the authors have ...
Read More
Navigation of mobile robots in the presence of obstacles

Robot navigation is one of the basic problems in robotics. In general, the robot navigation algorithms are classified as global or local, depending on surrounding environment. In global navigation, the environment surrounding the robot is known and the ...
Read More
Design and implementation of a navigation system for autonomous mobile robots

In this paper, a navigation system for autonomous mobile robots is proposed. Our navigation system is a hybrid of behaviour-based and model-based navigation systems. In our system, a behaviour-based subsystem is in charge of low-level reactive actions, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MULEA '19: 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications
October 2019
65 pages
ISBN:9781450369183
DOI:10.1145/3347450
General Chairs:
Jiang (John) Gao
Samsung Research America
,
Jia-Yu (Tim) Pan
Google, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
instruction following
natural language processing
robot navigation
visual grounding
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 110
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Visually Grounded Language Learning for Robot Navigation

MULEA '19: 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Biomimetic application of desert ant visual navigation for mobile robot docking with weighted landmarks

Navigation of mobile robots in the presence of obstacles

Design and implementation of a navigation system for autonomous mobile robots

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Visually Grounded Language Learning for Robot Navigation

MULEA '19: 1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Biomimetic application of desert ant visual navigation for mobile robot docking with weighted landmarks

Navigation of mobile robots in the presence of obstacles

Design and implementation of a navigation system for autonomous mobile robots

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media