ABSTRACT
Many people are excited to discover their ancestors and thus decide to take up genealogy. However, the process of finding the ancestors is often very laborious since it involves comparing a large number of historical birth records and trying to manually match the people mentioned in them. We have developed AncestryAI, an open-source tool for automatically linking historical records and exploring the resulting family trees. We introduce a record-linkage method for computing the probabilities of the candidate matches, which allows the users to either directly identify the next ancestor or narrow down the search. We also propose an efficient layout algorithm for drawing and navigating genealogical graphs. The tool is additionally used to crowdsource training and evaluation data so as to improve the matching algorithm. Our objective is to build a large genealogical graph, which could be used to resolve various interesting questions in the areas of computational social science, genetics, and evolutionary studies. The tool is openly available at: http://emalmi.kapsi.fi/ancestryai/.
- A. Bezerianos, P. Dragicevic, J.-D. Fekete, J. Bae, and B. Watson. Geneaquilts: A system for exploring large genealogies. IEEE Transactions on Visualization and Computer Graphics, 16(6):1073--1081, 2010. Google ScholarDigital Library
- P. Christen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, 2012. Google ScholarDigital Library
- P. Christen, D. Vatsalan, and Z. Fu. Advanced record linkage methods and privacy aspects for population reconstruction--a survey and case studies. In Population Reconstruction, pages 87--110. Springer, 2015.Google ScholarCross Ref
- J. Efremova, B. Ranjbar-Sahraei, H. Rahmani, F. A. Oliehoek, T. Calders, K. Tuyls, and G. Weiss. Multi-source entity resolution for genealogical data. In Population Reconstruction, pages 129--154. Springer, 2015.Google ScholarCross Ref
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183--1210, 1969.Google ScholarCross Ref
- M. R. Garey and D. S. Johnson. Crossing number is NP-complete. SIAM Journal on Algebraic Discrete Methods, 4(3):312--316, 1983.Google ScholarDigital Library
- D. Lazer, A. S. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, et al. Life in the network: the coming age of computational social science. Science, 323(5915):721--723, 2009.Google ScholarCross Ref
- E. Malmi, A. Solin, and A. Gionis. The blind leading the blind: Network-based location estimation under uncertainty. In Proc. ECML PKDD, pages 406--421. Springer, 2015.Google Scholar
- M. J. McGuffin and R. Balakrishnan. Interactive visualization of genealogical graphs. In Proc. INFOVIS, pages 16--23. IEEE, 2005. Google ScholarDigital Library
- J. E. Pettay, M. Lahdenperä. Rotkirch, and V. Lummaa. Costly reproductive competition between co-resident females in humans. Behavioral Ecology, pages 1--8, 2016.Google ScholarCross Ref
- E. Salmela, T. Lappalainen, J. Liu, P. Sistonen, P. M. Andersen, S. Schreiber, M.-L. Savontaus, K. Czene, P. Lahermo, P. Hall, and J. Kere. Swedish population substructure revealed by genome-wide single nucleotide polymorphism data. PLoS One, 6(2):e16747, 2011.Google ScholarCross Ref
- The Genealogical Society of Finland. HisKi project (Web interface). http://hiski.genealogia.fi/hiski?en, Accessed: 2017-01-07.Google Scholar
- W. E. Winkler. String comparator metrics and enhanced decision rules in the Fellegi--Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, pages 354--359. American Statistical Assn., 1990.Google Scholar
- E. Zagheni, V. R. K. Garimella, I. Weber, et al. Inferring international and internal migration patterns from twitter data. In Proc. WWW, pages 439--444. ACM, 2014. Google ScholarDigital Library
Index Terms
- AncestryAI: A Tool for Exploring Computationally Inferred Family Trees
Recommendations
Subsequent patient visit detection in a high volume OPD using record linkage techniques
COMPUTE '10: Proceedings of the Third Annual ACM Bangalore ConferenceRecord or data linkage techniques are used to link records which represent the same entity (e.g. patient, customer, citation, etc.) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be ...
Interactive Visualization of Genealogical Graphs
INFOVIS '05: Proceedings of the Proceedings of the 2005 IEEE Symposium on Information VisualizationThe general problem of visualizing "family trees", or genealogical graphs, in 2D, is considered. A graph theoretic analysis is given, which identifies why genealogical graphs can be difficult to draw. This motivates some novel graphical representations, ...
Linking records in dynamic world
PhD '12: Proceedings of the on SIGMOD/PODS 2012 PhD SymposiumIn real-world, entities change dynamically and the changes are capture in two dimensions: time and space. For data sets that contain temporal records, where each record is associated with a time stamp and describes some aspects of a real-world entity at ...
Comments