当前位置：首页 > 编程日记 > 正文

python 速度矢量_最近邻搜索4D空间python快速-矢量化

编程日记 2024-01-31 05:30:00

For each observation in X (there are 20) I want to get the k(3) nearest neighbors.

How to make this fast to support up to 3 to 4 million rows?

Is it possible to speed up the loop iterating over the elements? Maybe via numpy, numba or some kind of vectorization?

A naive loop in python:

import numpy as np

from sklearn.neighbors import KDTree

n_points = 20

d_dimensions = 4

k_neighbours = 3

rng = np.random.RandomState(0)

X = rng.random_sample((n_points, d_dimensions))

print(X)

tree = KDTree(X, leaf_size=2, metric='euclidean')

for element in X:

print('********')

print(element)

# when simply using the first row

#element = X[:1]

#print(element)

# potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius

dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)

# indices of 3 closest neighbors

print(ind)

#[[0 9 1]] !! includes self (element that was searched for)

print(dist) # distances to 3 closest neighbors

#[[0. 0.38559188 0.40997835]] !! includes self (element that was searched for)

# actual returned elements for index:

print(X[ind])

## after removing self

print(X[ind][0][1:])

Optimally the output is a pandas.DataFrame of the following structure:

lat_1,long_1,lat_2,long_2,neighbours_list

0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934 0.94374808 0.6818203 ][0.4236548 0.64589411 0.43758721 0.891773]

edit

For now, I have a pandas-based implementation:

df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined

X = df[['lat1', 'long1', 'lat2', 'long2']]

tree = KDTree(X, leaf_size=4, metric='euclidean')

k_neighbours = 3

def neighbors_as_list(row, index, complete_list):

dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)

return complete_list.values[ind][0][1:]

df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)

df.head()

But this is very slow.

edit 2

Sure, here is a pandas version:

import numpy as np

import pandas as pd

from sklearn.neighbors import KDTree

from scipy.spatial import cKDTree

rng = np.random.RandomState(0)

#n_points = 4_000_000

n_points = 20

d_dimensions = 4

k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))

df = pd.DataFrame(X)

df = df.reset_index(drop=False)

df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']

df.id_str = df.id_str.astype(object)

display(df.head())

tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])

dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)

display(dist)

print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)

print(X[ind_out].shape)

X[ind_out]

# fails with

# AssertionError: Shape of new values must be compatible with manager shape

df['neighbors'] = X[ind_out]

But it fails as I cannot re-assign the result.

解决方案

You could use scipy's cKdtree.

Example

rng = np.random.RandomState(0)

n_points = 4_000_000

d_dimensions = 4

k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))

tree = cKDTree(X)

#%timeit tree = cKDTree(X)

#3.74 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#%%timeit

_,ind=tree.query(X, k=k_neighbours,n_jobs=-1)

#shape=(4000000, 2)

ind_out=ind[:,1:]

#shape=(4000000, 2, 4)

coords_out=X[ind_out].shape

#7.13 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About 11s for a problem of this size is quite good.

https://www.dkcj.cn/info/1107.html

python 速度矢量_最近邻搜索4D空间python快速-矢量化

相关文章：

使用ajax不刷新页面获取、操作数据

C/C++面试题分享

js增加属性_前端js基础2

附加的操作系统服务

使用myeclipse的第一步

一道题弄明白二维数组的指针

Linux网络编程--进程间通信（一）

mysql 行号_PQ获取TABLE的单一值作为条件查询MySQL返回数据

UUID的使用及其原理

链表类型题目需要用到的头文件list.h

led计数电路实验报告_「正点原子FPGA连载」第八章按键控制LED灯实验

svn官方备份hot-backup.py强烈推荐

用js方法做提交表单的校验

tree类型题目需要用到的头文件tree.h

用easyui动态创建一个对话框

网站收录工具(php导航自动收录源码)_网站如何快速收录，网站不收录怎么办？...

JS Uncaught SyntaxError:Unexpected identifier异常报错原因及其解决方法

python 打印皮卡丘_Python到底是什么？学姐靠它拿了5个offer

有一个1亿结点的树,已知两个结点, 求它们的最低公共祖先!

数据库SQL优化大总结之百万级数据库优化方案

js定时执行函数

BST(binary search tree)类型题目需要用到的头文件binary_tree.h

终止js程序执行的方法

将BST转换为有序的双向链表!

计算机病毒实践汇总五：搭建虚拟网络环境

form表单提交前进行ajax或js验证，校验不通过不提交

中体骏彩C++面试题

Fibonacci数列的java实现

stream流对象的理解及使用

永成科技C++笔试题